Well, I'm just saying you'll never be able to build something with Crysis-like levels of detail when you go for an outdated scheme like this.
Modern hardware can easily handle millions of polys per frame. There's just no way your CPU can process them. So you have to stay out of the way of the GPU.

This algorithm by Alan Baylis that you mention, would that be this one from 2002?
http://www.alsprogrammingresource.com/portals_tutorial.html

Clearly an algorithm aimed at 2002 polycount and GPU capabilities is no longer up to date for 2009.
Posted on 2009-06-25 10:18:43 by Scali
Thats an unfair appraisal, I'm not sure what you base that on.
Alan's technique was never used in mainstream products.
If I may draw an analogy, the algorithm for BSP was published DECADES before Carmack wrote Doom.
That being the case, it should never have seen the light of day!
I'll be reserving my gpu muscle for eyecandy, concentrating on realtime radiosity.

I do agree with the spirit of your comment, as a generalization.
Posted on 2009-06-25 17:46:17 by Homer
I base it on the fact that game developers have been struggling with BSP and portal approaches ever since the launch of early T&L hardware. The Quake engine actually got criticism because the BSP approach severely limited polygon throughput. Later versions of Quake/Doom engines used 'leafy' approaches where he worked with larger groups of polygons instead.
Other developers dropped BSP altogether and went for other approaches, such as octtrees.

I'm not saying old algorithms are necessarily bad. But in this case the hardware caught up with BSP and made it obsolete in its original form.

The idea of portal rendering in itself is fine. You construct a viewing volume and test which objects fall into the viewing volume. If one or more of these objects is a portal, recurse into there aswell.

The thing is just, if you take a game like Crysis... The average screen has over 3 million polys. Now, let's say you have a 3 GHz processor... This means that best case you have about 1000 cycles to spend per polygon, if you are going to process things on a per-polygon basis. That's not a whole lot. Clearly there will be lots of other overhead in the game engine aswell, so the actual amount of processing time will be far lower. It simply won't be enough to do anything on a per-poly basis. You don't have the CPU cycles, period. A GPU does, however, so let the GPU figure out things on a per-poly level.

I'd like to direct you to the legendary Batch Batch Batch presentation by nVidia: http://developer.nvidia.com/docs/IO/8230/BatchBatchBatch.pdf
Posted on 2009-06-26 03:18:12 by Scali
That's a good read, although I've read it before.
If you walk back over my comments in the past few posts, you'll see that I have done exactly what they suggest - few, large batches rather than many, small ones.
I don't have access to geometry shaders under D3D9/XP (although they're available to opengl, which is annoying).. and I don't see myself jumping onto any other Windows OS for the time being.
So I'm stuck with throwing triangles at the videocard manually, which is why I'm using a partitioning scheme to help reduce the number of polygons I am throwing around (alleviate bus congestion bottleneck).
Whether I used octree, kd-tree, or any other kind of partitioning scheme is not relevant - I'm NOT using the classic BSP rendering techniques (tree walk required, despite the fact that the polygons are only in the leaf nodes)... I'm only taking advantage of the fact that BSP leaf nodes are convex subspaces, so there is a guaranteed zero occlusion when rendering a given convex cluster of faces (containing the camera), and I am using the Portal stuff to extend the previous premise further into Z space - minimal overdraw, WITHOUT the drawbacks associated with the classic BSP rendering techniques (front/back or back/front walk).

Believe me, if I could send a skeletal representation of my mesh to the gpu and generate my polygons from there, of COURSE I could produce reliably millions of triangles per second. I'd love to, and I'm not a bad shader programmer, I just don't have access to DX10 and above at this time. Besides, I want my stuff to run on 'anything out there', I expect my software to scale to the hardware, not the other way around.. which I believe is something you're looking to try to squeeze out of your DX10 engine?






Posted on 2009-06-27 23:27:54 by Homer
My next step will be to create storage lists of entities in each subspace - clearly, each subspace owns the set of entities which are completely within that subspace.

This begs an interesting question - if an entity is intersecting a portal surface, which subspace is the entity now within?

My answer to this question is - both (or all) of them.
Either I record a reference to the entity in each subspace which it partly or wholly intersects, OR, each entity contains a reference-list of which Portals it is intersecting.

It actually makes more sense to me that an entity should track its portal intersections, since we can set a flag within that entity to tell it to check for portal intersection until intersection ceases - at which point, it can remove the redundant reference.

The alternative requires processing of each subspace (cell) in the world, which is exactly the kind of exhaustive approach I am trying to avoid (generally).

Anyway, as soon as that's done, I'll be ready to reintegrate (lol) my Physics Engine component, with the ambition being to extend the partitioning scheme to the physical simulation.

For the moment, I added code to my per-frame RenderText method that determines and displays which Leaf (aka subspace aka cell) contains the Point at the Origin of the Camera... its crude, involves a full tree walk, its just more debug and feel-good code for me to prove to me that I actually understand everything I am talking about, and understand it intimately.

That brings me almost to the end of Alan's walk in this park - as soon as I've extended on both Nathan and Alan's previous work, and will full credit given, I intend to crow from the nearest rooftop :)
Posted on 2009-06-28 01:02:41 by Homer

That's a good read, although I've read it before.
If you walk back over my comments in the past few posts, you'll see that I have done exactly what they suggest - few, large batches rather than many, small ones.
I don't have access to geometry shaders under D3D9/XP (although they're available to opengl, which is annoying).. and I don't see myself jumping onto any other Windows OS for the time being.


Well, none of what I said required D3D10 or geometry shaders.
I'm just saying that you should try to keep your vertexbuffers and indexbuffers static wherever possible. Don't think in terms of single polygons, but think in terms of 'primitives' as the DrawPrimitive API does.
Even back in the GeForce2 days, GPUs were massively faster at culling invisible polys than a CPU was. So if I had say an object of 5000 polys... I would simply test if the object as a whole is visible or not (visible meaining: not completely outside the viewing volume... I don't even need to make the distinction between intersecting or fully inside). If it is, I just fire it off with a single call. The buffers are already stored in videomemory, and already optimized by the driver, so I get maximum performance. Even if only a handful of polys are actually visible, the GPU is faster at culling the remaining polys than my CPU could ever be, especially if I don't get in its way by modifying the buffers all the time (if you rewrite indexbuffers, you will leave 'gaps' between your vertexbuffers, which means that vertex caching will be far less efficient, it assumes linear transversal of the buffer, and can only cache a handful of vertices).


So I'm stuck with throwing triangles at the videocard manually, which is why I'm using a partitioning scheme to help reduce the number of polygons I am throwing around (alleviate bus congestion bottleneck).


Well, that's what I'm saying... If you just store the geometry statically on the videocard, it may be faster NOT to reduce the number of polygons, because you have eliminated the bus transfer from your algorithm altogether.
Again, a modern GPU can handle thousands polygons in the same time your CPU can handle one. That is the thing here... At first hand it may seem smart to reduce the number of polygons drawn... But if you dig deeper... if it takes more time to remove a polygon from the drawing set than it takes the GPU to just render it, you've just been fooled.
The same goes for a zbuffer for example. In the old days we used to just z-sort polygons (painters algorithm), because you didn't check the depth at every pixel, but just at every polygon. So you saved a lot of checks and a lot of bandwidth.
However, as geometry got more complex, the difference in number of pixels and number of polygons diminished. And as shading got more complex, the ability of the zbuffer to skip the shading for a pixel actually started saving time.
So even with software solutions, zbuffer was sometimes faster than zsorting. And on hardware it never even was an issue.



Besides, I want my stuff to run on 'anything out there', I expect my software to scale to the hardware, not the other way around.. which I believe is something you're looking to try to squeeze out of your DX10 engine?


My DX10 engine evolved from my earlier engines. Although I've long removed support for DX8 and lower, I kept DX9 support because of XP.
So my engine can still run on ancient hardware like a GF2, if I so choose. The old fixedfunction routines are still in there somewhere, including some CPU-based T&L, because although GF2 could do per-pixel lighting with dot3, it couldn't perform the setup for interpolating normals and light vectors in hardware.
So yes, my software can scale to the hardware very well. It's just that I draw the line somewhere. The problems with culling geometry even existed back on the GF2. Especially the more high-end ones, like the GTS/Pro/Ultra, had REALLY fast T&L engines, so polygons were really cheap to eliminate. I don't think it's useful to support hardware all the way back to GF2 in this day and age. I think if you just support DX9 SM3.0 cards, virtually everyone will be able to run it. And you can treat those very similarly to DX10 hardware. They are both massively faster with static geometry than they are with dynamic geometry. And in both cases their maximum polygon throughput is orders of magnitude larger than the throughput of a high-end CPU.

A simple example is this old piece of code: http://bohemiq.scali.eu.org/forum/viewtopic.php?t=35
It contains an optimized CPU path to generate the shadowvolumes, and a bruteforce vertexshader-based approach.
While the CPU algorithm is far more elegant, my 3 GHz Core2 Duo can only coax 1500 fps out of it... Running the vertexshader on the 9800GTX+ card that I have here, gives me 3200 fps, so more than twice as fast, despite the shaders having to do three times the work.
That's just how it is. My CPU is holding up my GPU because it can't process the polygons quickly enough. And this is just a test scene with a few thousand animated polygons (I believe it has about 2500 polygons).
It's a much better tradeoff to let the GPU do the bruteforce approach. Not only do I get a higher framerate, it also frees up the CPU for other tasks.
And as you can see from the few statistics posted in the thread itself, this is not a new phenomenon. It was the same back in the days of Pentium 4's and GeForce 6800s. I think if you support back to the level of GeForce 6-series today, you're being generous enough. That was 5 years ago.
That's actually the biggest of my gripes in this discussion... What I'm saying is not something new, it's a problem that has existed for years.
I'm not sure what you're targeting exactly, when you say "anything out there"?
Posted on 2009-06-29 04:34:42 by Scali

Finished implementing the new rendering code for 'World' geometry.
It uses the portal-based visibility culling scheme I outlined previously.
Since my World model is currently so simple, I can't appreciate a change in FPS over bruteforce rendering via DrawSubset.


Posted on 2009-06-30 10:52:44 by Homer

Since my World model is currently so simple, I can't appreciate a change in FPS over bruteforce rendering via DrawSubset.


Ah, I see... to you a mesh actually means a D3DX mesh? That would explain some of the questions I had.
But that was one of the points in the Batch, Batch, Batch presentation: if you are CPU-limited, you can add extra triangles for free.
Posted on 2009-06-30 11:41:16 by Scali
Yes, I make this distinction because any buffer full of vertices could be considered a Mesh, even if there is no topological information (such as edge connectivity)... I do use the term 'mesh' to describe a D3D mesh, since its not merely a point-cloud or even a triangle soup - which is what I import from a D3D Mesh, I do not import topological data, so I don't consider my imported data to be a mesh. I'm not exactly sure what the accepted definition of a mesh is, but I would strongly suspect it mentions topology, or at least implies connectivity between its elements.

Tonight I will begin to introduce chunks of my physics engine, starting with just the 'time core' and collision detection components. I'll leave collision response until later, simply because a game engine requires collision tests, but physical simulation of collision RESPONSES is not always a requirement.
Posted on 2009-07-01 01:35:54 by Homer

Yes, I make this distinction because any buffer full of vertices could be considered a Mesh, even if there is no topological information (such as edge connectivity)... I do use the term 'mesh' to describe a D3D mesh, since its not merely a point-cloud or even a triangle soup - which is what I import from a D3D Mesh, I do not import topological data, so I don't consider my imported data to be a mesh. I'm not exactly sure what the accepted definition of a mesh is, but I would strongly suspect it mentions topology, or at least implies connectivity between its elements.


Well, the concept of a mesh is far older than the D3D mesh class itself, and the D3D mesh class is just one specific interpretation of what a mesh is. So I don't automatically link the word mesh to the D3D object.
The definition for 'polygon mesh' on Wiki is huge:
http://en.wikipedia.org/wiki/Polygon_mesh

But they don't explicitly state that you need topological info. Besides, in D3D topology is implicit anyway, because your vertices are always ordered by triangle. You just specify whether you have a triangle list or strip or such, and your topology is implicitly defined. Not in the most useful format for some occassions perhaps, but you can always rebuild the topology from the list of vertices.

To me, a mesh is just a list of polygons that are in some way related to eachother. They are connected, so to say. Either physically, or because they share common properties like materials, or belonging to a single object etc.
That is, in the strictest sense I consider a mesh to be physically connected polygons... However, in practice you can just as easily 'abuse' the same datastructures for polygons that aren't necessarily connected. In D3D (or most triangle rasterizers at that), whether polygons are connected or not doesn't really have any significance.
In my implementation of a mesh, it's just a set of vertexbuffers, an indexbuffer and their primitive descriptions, combined with a single material description. So a 'mesh' is just something that can be drawn with a single material setup and then one or more DrawPrimitive() calls.
An object then is a list of meshes which may or may not share the same material. Objects also handle local coordinate spaces and animation and such.

So far that approach has been valid throughout various iterations of my engines, since the late 90s I think... but perhaps at some point that will change :)
Posted on 2009-07-01 02:23:31 by Scali
I'm making some hefty changes to the physics engine as I drag it headlong into the new GameEngine framework.
For starters, I've stripped it back for a staged implementation.
I've also decided to make some critical changes to the architecture of the physics engine.
Notably, I've decided to change my support for mesh collisions to use a reference mesh representing the (assumedly convex) hull, rather than the piecemeal pseudo-primitive approach I'd been using.
This will clean up the code immensely, and give me a groundwork for animated hulls.
I've already experimented in that arena with a demo that calculated boundingboxes for the influenced vertices of each bone in a SkinMesh in 'bone space', then animated these box-shaped collision hulls by simply driving them with the animated bone matrices. That looked promising, but adapting that to an arbitrary hull sounds more promising (better fit, more control over number and position of hull vertices, etc).

Posted on 2009-07-01 10:27:02 by Homer
I'd still like to know though... you say you want to target "anything out there". So what would be your minimum spec system, in terms of OS support, memory, CPU, videocard etc?

Edit: By the way, I can't really seem to get a grip on you and what you're doing. We seem to be talking on different wavelengths at times, I guess.
Perhaps you think I'm asking too many 'difficult' questions, and I get the feeling you're being a bit evasive at times.
I'm just genuinely interested in what it is you're doing. I've written my share of graphics routines over the years, and it's interesting to revisit some topics. I may discover things I haven't thought of yet, or things that can perhaps be done in better ways now than back when I did them. And conversely I may be able to give you some new inspiration aswell.
Posted on 2009-07-01 12:22:14 by Scali
By 'anything out there', I mean any Windows NT system and upwards.
We've made some early steps toward porting ObjAsm32 to Linux (under JWASM), but for now we're pretty much tied to 32 bit Windows systems. I'll begin to support 64bit as soon as I can justify buying a new machine given that I openly refuse to upgrade from XP SP3 RC4 to any other Windows OS.... time to jump ship, not go down with it.

Now that I've reached a reasonable milestone with the game engine, I'd be willing to post both binary and source for the portal rendering code, since I don't own any intellectual property in there.
I don't intend to be evasive, far from it... I'm willing to explain my ideas and concepts in great detail, but not many people want to know every little detail, they're just curious onlookers who like the occasional screenshot.
The comments in my sourcecode probably say a whole lot more than I do.

A little about me:
I tend to have a hacker mentality when it comes to programming in general - I sincerely believe that not only CAN the majority be wrong, but they USUALLY are... so I never accept anything on faith, no matter who says it, or how many times, or how loudly. This makes it possible for me to innovate, to discover new things, and to break new ground simply because I don't tell myself "it can't be done", or "its slow, and theres no way to speed that up"... whereas those who naiively assume that the assertions of others are correct (no matter how godlike they may be) are dooming themselves to relative obscurity, or at least banality, because they have tunnel vision they will never do anything that is not already being done, and they'll tend to just reinvent the wheel, and will never stand out from the crowd. This began for me when I learned to machinecode on the C64, and discovered that it was possible to use software to make the hardware do things well outside its design specs without breaking anything. I've been pulling things apart all my life to see how they work, and whether its code, electronics or machinery, I will invariably see IMMEDIATELY some things that can be improved, or are completely redundant. I will never "leave it alone, its not broken".
Also, I like to do things for myself - I could have just plugged in an existing physics engine, hey I could have just used an existing GAME engine, but what satisfaction (or potential for anything else) is there in that?
I need to be constantly challenged, and as a result, I tend to throw myself into waters that are just a little (sometimes a lot) over my head, I believe that the less you know about a given field, the faster you will learn about it, and the more likely you are to discover something new (see Paradox of the Expert).




Posted on 2009-07-02 01:34:32 by Homer

By 'anything out there', I mean any Windows NT system and upwards.


Yea, but from what I understood, you are using Direct3D 9, so that would already limit you to Windows 2000 and higher, and also a limited subset of CPUs and GPUs. So it's never REALLY 'anything out there'.
And although D3D9 goes back quite far in terms of hardware support, it's very difficult to make things work everywhere, especially when you get into fixedfunction shading. Some hardware only supports 2 textures per pass, others don't have this-and-that texture op, etc.
So where exactly do you draw the line?
Back in the day I would just make sure my code ran on GeForce2 and Radeon 7000-series. Those were the most common fixedfunction cards. Anything else I considered either too slow (Intel IGP) or too rare to even spend time on.
With shaders it became a bit easier, because a certain shadermodel defines a certain set of features.
So currently I think I draw the line at SM3.0. Good enough for nearly anything, and common enough that it will run nearly everywhere.


I don't intend to be evasive, far from it... I'm willing to explain my ideas and concepts in great detail, but not many people want to know every little detail, they're just curious onlookers who like the occasional screenshot.


I like the details. Sometimes minor details can be very valuable in optimizing an algorithm...
For example, 9 out of 10 GIF unpackers use a relatively naive decoding scheme. Because of the way the LZW compression works, you always decode data 'backwards'. So if I pack "backwards", the unpacker will get "sdwradkcab" from its table, which it will then have to reverse byte-by-byte. Usually this is done via a stack of some sort.
1 out of 10 GIF unpacker coders will have the realization that if you know how long the string is, you can reverse it in-place, and greatly speed up the unpacking. They will then see that you DO know how long the string is... since it's built up recursively (which is why it's backwards, you get 1 symbol and a reference to a previous string). So you start with strings of length 1, and everytime you encounter a new string, it is based on an existing string, but 1 symbol longer.

Similarly with huffman decoding, you often see people traversing a symbol tree bit for bit... while by definition if a huffman code has N bits, then any other with more than N bits cannot start with those same bits. In other words, that N-bit pattern is unique.
So, a much faster way is to precalc a table where you take the unique N-bit pattern for each symbol, and fill it out to a certain number of bits, say 8 bits, by bruteforcing all possibilities. In the table you store the symbol belonging to the N-bit pattern, and the actual length of the code (N bits).
Then you can decode by just grabbing 8 bits from the stream, doing a lookup, then advancing the stream by N bits. An incredible deal faster than walking down a tree for every symbol.

Some details are just little hints at how you can get your code much faster.


I tend to have a hacker mentality when it comes to programming in general - I sincerely believe that not only CAN the majority be wrong, but they USUALLY are... so I never accept anything on faith, no matter who says it, or how many times, or how loudly.


I can fully relate to that (I guess the above example also illustrates it). I suppose the majority of assembly programmers do, else they wouldn't be programming assembly in the first place.


This makes it possible for me to innovate, to discover new things, and to break new ground simply because I don't tell myself "it can't be done", or "its slow, and theres no way to speed that up"... whereas those who naiively assume that the assertions of others are correct (no matter how godlike they may be) are dooming themselves to relative obscurity, or at least banality, because they have tunnel vision they will never do anything that is not already being done, and they'll tend to just reinvent the wheel, and will never stand out from the crowd.


Yea, I can relate to that aswell. I wrote a software 3d engine in Java, complete with crazy things like multitexturing, texture filtering, bumpmapping, shadowmapping, skinning and everything.


Also, I like to do things for myself - I could have just plugged in an existing physics engine, hey I could have just used an existing GAME engine, but what satisfaction (or potential for anything else) is there in that?


Yea, in many cases I just prefer to 'roll my own'. The main reasons are that firstly I know EXACTLY what I want, so I want to tweak the code to meet my performance requirements, rather than some generic solution which may be far from optimal in my case. And secondly, I may be doing things that very few others before me have done, so there is no clear-cut solution available yet.

But over the years I've become a bit more 'mellow' in that regard. Partly because now that I work fulltime, things have changed. At work you just need to get the job done quickly, at times... and often the things are quite trivial anyway, so there isn't really much room for creativity or anything. Also, it leaves me less time to do hobby projects, and I may not always be in the mood for coding when I'm at home.

Another thing is that work (and experience) changes your outlook on some things. You get a better feel for what you should or shouldn't rewrite/optimize. Sometimes "good enough" is just good enough. For example, I rarely used MFC myself. People always complain about how bloated it is and all that... But at work, we pretty much use MFC on everything. Some of the stuff is actually quite good. If you want to make a nice window or dialog, MFC just makes it a lot easier for you. And performance isn't really an issue anyway... Aside from the fact that this part of MFC is actually just a very thin wrapper around the Win32 API anyway. It's mainly things like the CString, CArray etc that I'm still not that happy with.
I actually converted my main app and window for the D3D engine to MFC a while ago. Makes the code a lot cleaner and simpler, and in the end it's just a window to attach your D3DDevice to, so it doesn't affect performance at all.
So in about 10 years I've gone from using assembly everywhere (like the DirectDraw plasma) to using 'notorious bloatware' like MFC... Except I never made any compromises to the actual performance of my applications. I just learnt what to use where.

It also helps that I developed a nice library of standard code. For example, I implemented a state caching system in my D3D10/11 engine a few weeks ago... It still uses a hashtable class that I wrote about 10 years ago. It was aimed at maximum lookup speed, and it still is a great tool for that. It still surprises me how often I see code at work where I think "This could have been SO much faster and/or simpler if they just used a hashtable". People just like to loop through lists or have tons of if/else clauses stacked together. Perhaps they don't know any better.
Posted on 2009-07-02 03:27:16 by Scali
Yeah, we began supporting a state table in OA32 about two years ago.
Really helps in an object oriented environment where everything has its own set of states.
Posted on 2009-07-02 09:10:04 by Homer
But you still avoided the question of what hardware you're actually targeting :)
Posted on 2009-07-03 03:30:23 by Scali
anything that complies with pixelshader3.0, under dx9.0c or above, since i wont be doing (much/any) vertex shading on the gpu
Posted on 2009-07-03 04:42:40 by Homer

anything that complies with pixelshader3.0, under dx9.0c or above, since i wont be doing (much/any) vertex shading on the gpu


Okay, so basically we're both targeting SM3.0 and higher...
Why aren't you doing vertex shading on the GPU though?
Posted on 2009-07-03 04:48:24 by Scali
The only place I'd ever need it is skinmesh, and of the four popular styles of skinmesh animation, I found virtually no benefit in using GPU over CPU - in fact the CPU palette version produces the highest fps on my 8600 and 8800 cards... that being the case, I'm not inclined to waste gpu time on matrix palettes, I can think of better ways to flex the gpu muscle than this. I'm a shader fan, but I accept that some things are just not shader-oriented, regardless that they are 'doable'.


Posted on 2009-07-03 08:30:37 by Homer

The only place I'd ever need it is skinmesh, and of the four popular styles of skinmesh animation, I found virtually no benefit in using GPU over CPU - in fact the CPU palette version produces the highest fps on my 8600 and 8800 cards... that being the case, I'm not inclined to waste gpu time on matrix palettes, I can think of better ways to flex the gpu muscle than this. I'm a shader fan, but I accept that some things are just not shader-oriented, regardless that they are 'doable'.


What about that old skinned shadowvolume I linked to earlier?
Not only does it skin the shadowvolumes AND the object itself, it also does it in a relatively inefficient way, because for each vertex you need to calculate the entire triangle. So you skin each vertex 3 times, where the CPU can do it only once.
And STILL the GPU is about twice as fast. So technically that makes the GPU 6 times as fast as a CPU at skinning operations.

Vertexshaders are THE place for flexing GPU muscle. Pixelshaders are nice and all, but if you're going to use your CPU for T&L, you'll be bottlenecking the triangle throughput anyway, so you CAN'T flex the GPU muscle on pixelshaders (unless you want to use REALLY lowpoly scenes at REALLY high resolutions and AA, and LOTS of overdraw... but why do that when you can use vertexshaders to get highpoly for free?)

So what exactly are you doing that makes it so slow? You can just look at the shader code I use in my example. I believe I had a palette with 4 matrices there, but it could scale up to more matrices easily, up to about 20 I guess. The difference between CPU and GPU would only get larger.
How is your approach different from mine?
And considering the artificial limits you're putting in with your CPU, what exactly DO you plan to use the GPU muscle on?
Heck, even back in the GF2 days, I would limit my CPU-code to only update the parts of the vertexbuffer that hardware T&L couldn't handle... Eg calcing per-vertex light vectors for dot3 lighting. I'd still let the GPU handle all other T&L, because it did it way faster than the CPU could. Matrix/vector calculations, dotproducts, crossproducts, sqrts, pow... GPUs demolish CPUs in those categories.

Wanting to re-invent the wheel is one thing.... but you have to be careful... Sometimes you arrive at a square wheel... then you establish that the flat sides are the problem, so you decide to eliminate one of the flat sides and arrive at a triangular wheel... If you know what I mean :)
Posted on 2009-07-03 08:43:30 by Scali