If you had two mesh object instances, you would need to multiply matrices.
Period.
Posted on 2012-09-16 04:27:50 by Homer

If you had two mesh object instances, you would need to multiply matrices.
Period.


Because you say so?
This makes no sense whatsoever.
What do you even mean by 'two mesh object instances' in this context?
Besides, I already mentioned (mat*mat)*vec vs mat*(mat*vec) earlier.
It sounds like you don't grasp basic matrix math.
As always: produce example code if you want to show something.

Heck, this whole thread is ridiculous. "Case study"... Pah! Could you get any more pretentious?
"Projection can be done without matrices!". No s***, Sherlock!
Then you come up with some code that uses 4 constants.
Wow, what a surprise!
Seeing as the average projection matrix looks like this (http://msdn.microsoft.com/en-us/library/windows/desktop/bb205351(v=vs.85).aspx):
xScale     0          0              0
0        yScale      0              0
0        0        zf/(zn-zf)        -1
0        0        zn*zf/(zn-zf)      0

Hey look! Only 5 of these values are non-zero in the first place! And one of them is -1, which we can easily replace with a subtraction. So that's only 4 constants left! Wow, amazing that this can be done with a float4 constant!
What an amazing conclusion to a fascinating case study!
I bow down to your greatness, Homer!
I can't wait until you also explain how you can clip polygons without actually clipping them, but just doing a division by w!
Posted on 2012-09-16 04:38:04 by Scali
Wow, you understood it.

You do realize this is not the only one, but it is the one I chose for my explanation in my blog, as it is the easiest to understand.
But they all produce the same output variables.

Our total saving?

Projection Matrix, 75%

View Matrix, 50%

World Matrix, 50%

we used 41.6 % gpu bus width and slightly less hw ops on the gpu.


We can do something similar for the world and view transforms, if we accept that we can eliminate scale and shear from the transforms, so there is only rotation and translation.
They can be passed as 2xVec4, and although the math for the transformations through three spaces is roughly equivalent, we benefit in two ways - 1: we get the partials (world*view for example) and we get less data sent across the gpu bus, which is quickly becoming the worst bottleneck.

This especially suits bone transforms, since we can (usually) be sure that they don't contain any scale keys.

Posted on 2012-09-18 04:39:10 by Homer

Wow, you understood it.


Well obviously. My very first post was already pointing out the triviality of matrix operations being just another notation for linear operations. Then I tried to gently point out that anyone who understands basic matrix/vector algebra will implicitly also know how to do all that math without matrices and vectors.
Obvious math is obvious, it's right there: http://en.wikipedia.org/wiki/Matrix_multiplication#Matrix_product_.28two_matrices.29
Usually these basics of matrix mathematics are taught in highschool or at least first year college. You never finished school or something, Homer? For people who did, everything you said in this thread was painfully obvious, and warranted no 'case study' of any kind.
Even for people who haven't finished school, this would be one of the first things they'd learn when they were to start with 3d graphics programming (back in the late 80s/early 90s, a lot of 3d demos were written by guys who were only 16-18 years of age. And they did all this on systems without FPUs as well, requiring a deeper understanding in order to maintain precision, stability and performance. I recently wrote some 3d routines on Amiga and 16-bit x86 myself to see if I 'still had it'. Apparently I do).


Our total saving?

Projection Matrix, 75%

View Matrix, 50%

World Matrix, 50%


Again, produce code so that these savings can be verified.


we used 41.6 % gpu bus width and slightly less hw ops on the gpu.


That's funny, since a few posts ago I pointed out that you'd actually need 3-4 EXTRA instructions on the GPU (by actually comparing matrix-based objectspace-to-clipspace code with your cameraspace-to-clipspace code).

Aside from that, your comparison is flawed, because you only look at the matrices individually, and ignore any other shader variables.
Namely:
- A single matrix takes 64 bytes
- You claim 16 bytes for projection and 32 bytes for view and model each, so a total of 80 bytes.
- If we were to send a single model*view*projection matrix, it is still 64 bytes, which is actually 16 bytes *smaller* than your example. It would also require less instructions in the shader, as I already demonstrated earlier by comparing two shaders and looking at the generated instructions by the compiler.
Your approach gives up the flexibility of multiplying all matrices together. We've already seen that you cannot combine the projection with other operations, and the same will probably go for view and world, if you would ever produce code for them.


gpu bus, which is quickly becoming the worst bottleneck.


Not at all.
Firstly, I already pointed out that a few bytes more or less are not an issue whatsoever (by using actual figures from practical code tests). Modern systems have oodles of bandwidth. Saving a few bytes on a bus that can handle a few GB/s isn't going to do much.
Secondly, you wrongly assume that the GPU bus is the only overhead.
The REAL overhead is in the communication between your application and the driver, in handing over the data.
Updating shader variables has such high overhead that a few extra instructions on the CPU and a few more bytes transferred are hidden entirely (burst transfers and all that).
Having said that, in any reasonable scenario it should still be the general CPU and GPU code that should form the bottlenecks, not so much the actual API calls/data transfer between CPU/GPU. As demonstrated by my examples of a skinned model on a PII 350 and an Android phone.

Aside from that, your logic is flawed since it breaks some basic rules of optimization:
Move as much workload out of the innerloop as possible.
Namely, even if we assume that we can win some time by sending less data to the GPU, this isn't meaningful if it makes the GPU execute more code for every vertex or even every pixel.
We would be talking time saved once every object (perhaps 50 times per frame, in realistic scenarios), vs time wasted every vertex (thousands of times per frame, perhaps even millions for a modern high-end scenario such as Crysis).

The problem is, it is impossible to argue with you.
Firstly, you seem dyslexic or whatever your problem is. It appears that your brain does not even process half the things I write in my posts. The opposite is also true: when you write posts, your brain seems to stumble over itself, confusing all kinds of terminology, resulting in some incomprehensible blurbs (cases in point: equating division-by-W to clipping, and equating normalization with homogeneous projection).
In this recent exchange it also looks like you were trying to use the geometry instancing technique as an example, but you didn't articulate that clearly at all, nor did you acknowledge that this would still only apply to instances of a single object, where you generally have multiple objects in a scene, even if you would use instancing. It's not a catch-all solution (not to mention that your claim for requiring matrix multiplication is complete nonsense, see below).
Secondly, you always propose these wildly pretentious claims, but never back anything up with actual code which would allow others to understand and verify your claims. Let alone that they could offer suggestions to improve the code.
Lastly: you still have not commented on the clipping issue in the D3D pipeline thread. Am I to conclude that you're still denying that there is more to clipping than just rendering the triangles entirely or rejecting them, despite the overwhelming evidence?

If you weren't so convinced of your own greatness, perhaps you might actually turn out to be a nice guy, and perhaps we could actually have meaningful, constructive debates, from which we might both learn. You would, at any rate.
Case in point, my skinning shader: http://bhmfileformat.svn.sourceforge.net/viewvc/bhmfileformat/trunk/BHM3DSample/Data/VertexSkinShader.glsl?revision=55&view=markup
As you can see, it uses 4 different world matrices per vertex, and a separate projection matrix (or actually view*projection to be exact, in this case. You see, that's the beauty of matrices... No, you don't *need* a matrix for projection, but if you use a matrix, you can easily concatenate it with the view matrix and possibly others, so that you get projection for free. *That* is why the projection setup is normally in matrix form).
However, despite your claims there is no matrix*matrix operation in the shader. It does mat*(mat*vec) rather than (mat*mat)*vec to save valuable instructions.

position  = bones)] * Vertex * Weights[0];
position += bones)] * Vertex * Weights[1];
position += bones)] * Vertex * Weights[2];
position += bones)] * Vertex * Weights4;
...
gl_Position    = ProjectionMatrix * position;


Another little trick here is this:

float Weights4 = dot(Weights, vec4(-1,-1,-1, 1));

Instead of storing 4 weights per vertex, I only store 3. The total weight always has to sum up to 1, therefore the 4th weight can be inferred from the first three.
This saves 4 bytes per vertex. In my case the geometry consisted of about 6000 vertices if I'm not mistaken. So I save 23k on a single object. Considerably more substantial than what I'd get from any matrix reduction.
It also only costs me one instruction to calculate. Good tradeoff.
TIL
Posted on 2012-09-18 05:53:59 by Scali
The real point is that, of the three major transforms, we need to update them separately, and combine them in the shader, or we have to do it MANY times on the cpu, for a single scene.

Example. The projection transform normally only changes if we changed the screen resolution (or just created a window). View transform changes once per frame, assuming the player wiggled a controller. World transform needs to change once for every rendered instance of a reference mesh.. possibly several times per frame.

So being able to send the data in a smaller form, piecemeal, even if we have to do (almost as I claim, since I forgo scale) as much math on the gpu, it's usually going to be faster on the gpu, and, we are reducing our bus bandwidth, and achieving higher framerates.

If we only ever had to send one matrix, once, I would be cheering for you.
Posted on 2012-09-19 05:23:33 by Homer

The real point is that, of the three major transforms, we need to update them separately, and combine them in the shader, or we have to do it MANY times on the cpu, for a single scene.


Uhhh... Shaders are stateless. If you multiply inside a vertex shader, that multiply is done for EVERY vertex. Which is a LOT more often than whatever work you need to do on the CPU (which, as I said already, will mostly be hidden by the driver/bus overhead).


Example. The projection transform normally only changes if we changed the screen resolution (or just created a window). View transform changes once per frame, assuming the player wiggled a controller.


So we have already established that view*projection only changes once per frame at most... Gee, one whole matrix multiply per frame. You really think our multi-GHz, multi-core, SIMD-capable CPUs are up to that?
Not to mention that CPU and GPU work mostly asynchronously. So in many cases the CPU can work ahead and queue up work in the driver. So your matrix multiplies can be performed while the GPU is busy, which would otherwise be time wasted by the CPU waiting on the GPU anyway.


World transform needs to change once for every rendered instance of a reference mesh.. possibly several times per frame.


Oh dear, 'several times per frame'... Compared to 'everytime per vertex' (as stated, thousands to millions of times per frame). Yes, I really see where you're going with this... Wait, what?


it's usually going to be faster on the gpu


Wrong, see above.
As I already said, you are moving more workload into the innerloop.


and, we are reducing our bus bandwidth, and achieving higher framerates.


Again, wrong.
The theoretical bus savings are immaterial given the other overhead.

Perhaps you are not aware of this, but there was a time when ALL 3d mathematics were performed on the CPU. So instead of just a handful of matrix multiplies to set up shader constants, the CPU also had to do the thousands of multiplies for per-vertex T&L and all that.
I think you have completely lost all perspective here (no pun intended).
Posted on 2012-09-19 07:46:55 by Scali

Whatever you like, I've made some major advances in my advanced material based generated shader scheme, and there are still no matrices anywhere.
I'm happy with where I am going and not really interested in anything that is not constructive.
Posted on 2012-09-21 03:09:47 by Homer

I'm happy with where I am going and not really interested in anything that is not constructive.


If you think optimization tips and sharing experiences with tuning performance-critical code on various platforms are not constructive, then fine. I won't waste my time on you anymore.
I just hope other people have enough common sense to steer clear of your suggestions.
Posted on 2012-09-21 03:29:48 by Scali