You cannot show more than one object with 'four instructions'. So it wont work for anything other than a very simple demo drawing one model space.


We are discussing vertex shaders here.
Obviously the matrices (like many other variables) need to be updated for different objects between draw calls.
That doesn't mean that you need more than four instructions *inside* the vertex shader. This has nothing to do with the number of objects being drawn.
We are talking per-vertex operations vs per-object operations here. Since the frequency of per-vertex operations is a lot higher than that of per-object operations, it is clear that the per-vertex case is where most of the savings should be done.


Don't try to talk s***. You need a unique modelspace transform for every instance of a geometry. You know that you can't just send one WorldViewProj for everything. Let alone we talk about things like lighting and shadowing, which you obviously think are all best done in viewspace. Seriously man, you can't just send one matrix and call it a day. Jeez.
And thats using ONE shader! If we have more than one shader, there's no guarantee that uniforms will be stateful, we need to re-transmit them when shaders switch!


Wow, getting touchy, are we?
Anyway, all this hollow rhetoric aside: you still need to provide an example that shows that you are actually saving processing power.
Also, don't accuse me of things I never said. I never said you can use one matrix for everything (merely that you only need one matrix at a time in the simplest case, like that of your perspective example... Besides, one can argue that you still don't need to perform matrix*matrix operations inside the vertex shader in many cases even if you do need multiple matrices. As long as the source matrices are static for the entire draw call, then so are the results of the multiplications, and as such they can be precalced and sent as extra shader constants... Or, failing that, you may be able to do mat*(mat*vec) rather than (mat*mat)*vec, still saving you from doing a full matrix multiply).
I also never said that things are best done in viewspace. I have absolutely no idea where viewspace came from here.

Also, with D3D10+ we do have the guarantee that such data is stateful. I believe there are some recent OpenGL extensions for uniform buffers that do the same.
Posted on 2012-08-08 08:27:26 by Scali
I apologise for my bad mood yesterday, you didn't deserve that. I'm human and under a lot of pressure (work related), I have no good excuse so please disregard. I am not proud of my outburst.

GL extensions tend not to be well supported across the board, as by their nature, they are not, at least yet, 'core' functionality, it tends to differ from vendor to vendor.

All truth passes through three stages. First, it is ridiculed. Second, it is violently opposed. Third, it is accepted as being self-evident.

Shader multiplication, whether on the cpu or gpu, requires a crapload of mul operations, 16 from memory.
But that's not actually the problem I was alluding to.

The problem, in my mind, is not the speed of muls in the gpu, its the gpu bus, we need to send data that is changing, and by reducing that to a fraction of its former self, we stand to win, and that would still be the case even if the gpu had to work harder, this is an acknowledgement of their complimentary parallel nature, nothing more.

If we can reduce the bus traffic AND the number of operations (on either side of the fence), so much the better, this is not a formal proof granted, but you should be able to smell the win on the bus issue alone.

With regards to statefulness of opengl shader uniforms, this is a thorny issue, the specs state somewhere that they are stateful, and somewhere else mention that this state exists until the shader changes. Some implementations do keep uniform state per shader, some do not, and there is no concensus, so we should assume they won't survive a shader switch, and have to send them per shader, per frame, all of them.
Posted on 2012-08-09 03:21:52 by Homer

GL extensions tend not to be well supported across the board, as by their nature, they are not, at least yet, 'core' functionality, it tends to differ from vendor to vendor.


If I'm not mistaken, it is this extension: http://www.opengl.org/wiki/Uniform_Buffer_Object
It is core functionality in OpenGL 4.3, which afaik both nVidia and AMD already have drivers for (and they supported it as an extension in earlier drivers already).

All truth passes through three stages. First, it is ridiculed. Second, it is violently opposed. Third, it is accepted as being self-evident.


Ah, more hollow rhetoric.


Shader multiplication, whether on the cpu or gpu, requires a crapload of mul operations, 16 from memory.


As I say, mat4x4*vec4 is 4 operations (either through dot4 instructions or multiply-add, depending on the orientation).
mat3x4*vec can be done with only 3.

So even if you need 2 matrices, that would only be 6 to 8 instructions if you do mat*(mat*vec) as I said before.

The problem, in my mind, is not the speed of muls in the gpu, its the gpu bus, we need to send data that is changing, and by reducing that to a fraction of its former self, we stand to win, and that would still be the case even if the gpu had to work harder, this is an acknowledgement of their complimentary parallel nature, nothing more.

If we can reduce the bus traffic AND the number of operations (on either side of the fence), so much the better, this is not a formal proof granted, but you should be able to smell the win on the bus issue alone.


And you'd be painfully wrong.
4x4 Matrices are only 64 bytes. You'd need a LOT of matrices before you'd ever get into bandwidth problems, many more than you could store in the constant registers on the GPU in the first place.
The PCI/AGP/PCI-e buses work with burst transfers. The real overhead is in setting up a transfer, which means that you get the first few hundreds of bytes 'for free'. Much like how rendering 1 triangle is not faster than rendering 1000 triangles on a T&L-capable GPU. It's purely the overhead at that point, not the actual transfer/processing speed.

So saving a few bytes on some matrices is not going to earn you much, if anything at all. You may save a handful of bytes that you may have gotten 'for free' anyway. In return, you need to put in extra GPU time on EVERY vertex that you process. That is something you WILL notice.

You want an example of bus speed? Well I can give you one.
With our current work, we are streaming HD-video directly to a texture. At 720p50 that is 1280x720 32-bit pixels, 50 times a second. So 1280*720*4*50 == 175 MB/s streaming through the bus.
Or at 1080p30 that is 1920*1080*4*30 == 237 MB/s.
I can easily do this in realtime, while the renderer maintains a framerate of well over 1000 fps, even on a mainstream videocard like a Radeon HD5770 or GeForce GTX460.

Now, 237 MB/s, that is 3.7 million(!) 64-byte matrices per second. And you're telling me that this is where savings should be done? Nope.

More sensible places for saving bandwidth would be the vertex data, or texture compression. Those require considerably more bandwidth because they are processed per-vertex or even per-pixel. So savings make more sense there. The data is also much larger than your matrix data, so you may even be able to save a considerable amount of videomemory, allowing you to use more geometry/texture data before having to spill out into system memory.
Posted on 2012-08-09 03:40:50 by Scali
My current game has 36 shaders, and each one of them takes three separate matrices (I could reduce to two, but at the cost of some cpu muls which become a new bottleneck), amounting to a crapload of uploading, per shader, per frame. I can cut that down to about 25 percent, with some caveats, like, eliminating scale from world transform. Then on the gpu, the number of operations required for the same linear operational series is similarly reduced, and the framerate in my early tests is definitely showing that.

I would certainly go along with your comments about thin vertex data, but there's not a whole lot we can do there, this is a place where we can make 'some' impact, which reflects positively on our framerate, and gives us a competitive advantage, for really, not much effort. It might not seem much to you.

The less shaders you have, the less you stand to gain. If you have one shader, this will all be nonsense to you.
Posted on 2012-08-09 04:56:43 by Homer

My current game has 36 shaders, and each one of them takes three separate matrices (I could reduce to two, but at the cost of some cpu muls which become a new bottleneck), amounting to a crapload of uploading, per shader, per frame. I can cut that down to about 25 percent, with some caveats, like, eliminating scale from world transform. Then on the gpu, the number of operations required for the same linear operational series is similarly reduced, and the framerate in my early tests is definitely showing that.


Well, my skinning code used a matrix palette of 14 matrices to render the claw animation (alongside 2 other matrices and other lighting/material information). And that code can easily render at 8000+ fps on a modern system. In other words: at a framerate of 60 fps, you could render about 133 objects with 16 matrices (probably more, because my 8000 fps include the overhead of starting/ending each renderpass, presenting and clearing the backbuffer etc).
So I really don't see how 3 matrices would be troublesome.


The less shaders you have, the less you stand to gain. If you have one shader, this will all be nonsense to you.


I beg to differ. It is not about the number of shaders, but about how often you update the constants. Even if you use the same shader for all objects, you might still need to update matrices and material/light information in the shader for each object.
The only thing you'd save would be the changing of the shader itself (but that does not apply to this discussion of matrices vs other types of shader constants).
And you'd only need 2 shaders to lose that advantage. Whether you ping-pong between 2 shaders all the time, or switch between 36 shaders, or however many, doesn't really matter.
Posted on 2012-08-09 06:12:07 by Scali
Yep, 2 or 200, if theres more than one, then we need to deal with state.
Sometimes, on some drivers, we don't. Per shader.
But we have to cater to the typical, usual, case where uniforms are statefule ONLY for the 'lifetime' of a shader.
That sucks, but its the way it is, I do my best to contend with this kind of state flapping.
OpenGL is not really good at statefulness, the more useless state changes you avoid in cpu land, the less redundant opengl calls you make, and that too, adds up, and if you say it shouldnt matter, I will agree!
Posted on 2012-08-09 07:54:42 by Homer

But we have to cater to the typical, usual, case where uniforms are statefule ONLY for the 'lifetime' of a shader.
That sucks, but its the way it is, I do my best to contend with this kind of state flapping.
OpenGL is not really good at statefulness, the more useless state changes you avoid in cpu land, the less redundant opengl calls you make, and that too, adds up, and if you say it shouldnt matter, I will agree!


Well, I got the 8000+ fps without using any kind of stateful shader constant trickery. Just vanilla DX9 and OpenGL 2.0 code.
The code had 3 different objects, using 2 different kinds of vertex shaders, and each having their own object space. So there were basically 3 shader changes + full state updates per frame, where one of the shaders had 16 matrices. And all that still at 8000+ fps, on a mainstream Core2 Duo 3 GHz and GeForce GTX460. Not a fancy modern CPU and high-end GPU.

So I say: not a problem with shader state changes.

Anyway, the OpenGL renderer is fully opensource... so.
Posted on 2012-08-09 08:42:33 by Scali
Lately, I'm not working on my high end machines, I am coding for phones, so we are JUST above the baseline and BARELY have a shader engine, and the cpu is WAY slower than the gpu is, in this circumstance, all the things I have said will ring true, perhaps not AS relevant for modern gear, but still a measurable improvement, I checked after you posted, I did get a benefit on my GL3.3 engine, I didn't try on 4+
My goals at the moment, involve porting high end shaders to old shadertongue, and make them run at an acceptable rate, so I'm pretty stoked about all these developments.
Posted on 2012-08-09 09:00:01 by Homer
Well, I ported my OpenGL code to iPhone and Android as well, and it could easily reach 60 fps. Sadly there doesn't seem to be a way to disable vsync, so there's no easy way to find out how fast it can REALLY render.
Mind you, even my Athlon XP1800+ with Radeon 9600 could still reach about 700 fps on that code, and that's a box that is more than 10 years old, probably actually slower than a modern phone.

But well, feel free to post some actual code, so that we can compare and perhaps even optimize some stuff.

Edit: Oh in fact, it gets worse!
I just remembered that I ran it on the PII-350 with Radeon 8500 a while ago:
http://www.asmcommunity.net/board/index.php?topic=29617.210

That one still gets 375 fps. Now I *know* that thing is slower than a modern phone.
Posted on 2012-08-09 09:27:47 by Scali
As mentioned, I'm not actually developing on a GLES-compliant platform - I'm using that PowerVR wrapper to emulate GLES2 for alpha development, so I can indeed disable vsync and get some kind of idea of performance (although not a true indication, at least I can determine where bottlenecks are since the underlying infrastructure is either the same or close enough).
I can at least get a feeling for whether some implementation is a bad idea or a good idea by comparison.

I'm now looking at compressing tangent space information per vertex with a single quaternion representing the three orthonormal vectors, if I can solve one issue relating to the handedness of the basis transforms.
Since this transform changes per vertex, there's no value in using matrices.

The issue at hand is that modelling apps don't seem to do a good job of ensuring that the handedness of the tangentspace basis for each vertex of a given face is the same - this gives us a headache when we try to interpolate them and then use the interpolated tangentspace vectors in the fragmentshader... not a problem for simple texturing, but a problem for lighting of bumpmapped geometry (especially when skinned as well).
Posted on 2012-08-11 00:08:10 by Homer
Well, I suppose one way to measure it is to just skip the glSwapBuffers() call. So you render everything, but just don't display it on screen.
I'll have to try that and see.

At any rate, the Pentium II I mentioned only has an AGP 2x bus, so it has considerably less bandwidth than a modern system. It has a theoretical limit of 533 MB/s.
A modern PCI-e 2.0 16x bus has a theoretical limit of more than 8 GB/s.
I'm not sure what the bus speed is between the CPU and GPU of a modern smartphone/tablet SoC, but I bet it's more than AGP 2x.

As for modeling... at the time I wrote my BHM exporter for 3dsmax, I don't think there was a way to get tangentspace information from the modeler at all. So I calculated tangentspace myself, in the exporter. As a result, I never had any issues with that in the first place.
I've also done an experiment with calculating the tangentspace in the geometry shader.
Posted on 2012-08-11 06:02:31 by Scali
If you call glFlush instead of glSwapBuffers, I guess it's a fair appraisal?
That way the gpu needs to actually complete processing, artificially stalling the cpu until its done?
Posted on 2012-08-11 06:51:46 by Homer
Hum, it was not quite as simple as that on Android, since you don't call glSwapBuffers() yourself.
Instead, you implement a 'Renderer' object, which has an OnDrawFrame() callback, where the rest is done by the Android framework itself, in which you register your Renderer.

Well, I did a quick hack where I just implement my own loop inside the OnDrawFrame(), so it never leaves the callback to get to the glSwapBuffers() call (or whatever equivalent they may use).
And the result is that instead of 60 fps, I now get well over 100 fps.
The actual framerate seems to be incredibly jumpy, probably because this is not how the framework should be working, and it probably will be doing stuff in the background... But it seems to average at around 160 fps, with peaks of 233 fps.
So at the very least it's a rough indication of the actual performance.
Posted on 2012-08-11 06:56:57 by Scali
I guess thats the best we can hope for in this new fangled stuff, it's not made for our kind of programmer and I'm not sure who they are aiming for anymore, I just deal with the fallout.
Posted on 2012-08-13 04:41:14 by Homer

I guess thats the best we can hope for in this new fangled stuff, it's not made for our kind of programmer and I'm not sure who they are aiming for anymore, I just deal with the fallout.


I know what you mean, I stopped worrying about that sort of thing. I only have 2 choices anyway:
1) I use their platform
2) I don't use their platform

In general Android isn't that bad, once you've ironed out the initial kinks. And having vsync enabled by default is not bad either. Just makes it harder to benchmark during development :)
Posted on 2012-08-13 08:39:49 by Scali

Okay, work out the math and post it here, then we'll see how many instructions it takes. I'd be surprised if you can come up with a solution with less than 4 instructions.


No code was ever produced.
The validity of your claims of doing manual perspective being more efficient is still up in the air.
Then again, in light of recent developments regarding perspective matrices, I'm not surprised that no code was ever produced...
Posted on 2012-09-15 17:52:28 by Scali
Projection Transform implemented as a Vec4 !!

Methods taken from my Pure Quaternion Camera class, based on knowledge gained from the journey taken within this thread !!
This code was already posted in my Blog, on this site.

Credit to Dzmitry Malyshau (kvarkus, author of KRI engine) for proving to me that this is not just possible, but cheaper. He did not show me how to generate the values.

Instructions for Vertex Shader authors included !!


// This function calculates the Projection Transform data, storing it as a Vec4, to be applied in the VS
//
void Perspective()
{
float ymax, xmax;
float temp, temp2, temp3, temp4;
ymax = m_fNear * tanf(Math::radians(m_fFOVy));
xmax = ymax * m_fAspectRatio;
UpdateProjectionTransform(-xmax, xmax, -ymax, ymax, m_fNear, m_fFar);
}


inline void UpdateProjectionTransform(float left, float right, float bottom, float top, float znear, float zfar)
{
float temp, temp2, temp3, temp4;
temp = 2.0 * znear;
temp2 = right - left;
temp3 = top - bottom;
temp4 = zfar - znear;
m_ProjectionValues[0] = temp / temp2; // X scalar
m_ProjectionValues[1] = temp / temp3; // Y scalar
m_ProjectionValues[2] = (-zfar - znear) / temp4; // Z scalar
m_ProjectionValues[3] = (-temp * zfar) / temp4; // Z translate

// Note for shader programmers: clipspace calculation is as follows:
// Clip.W  = -View.Z
// Clip.XYZ =  View.XYZ * Values.XYZ
// Clip.Z  +=  Values.W
}


Posted on 2012-09-16 01:55:37 by Homer
That's not the shader code I asked for.
Make something that takes an object-space vertex as input and produces proper output to be forwarded to the rasterizer stage.
Then we can run it through the shader compiler and compare its instruction count against a regular vertex shader like this:

cbuffer cb0
{
row_major float4x4 mWorldViewProj; // World * View * Projection transformation
};

struct VS_INPUT
{
float4 Position : POSITION;
};

struct VS_OUTPUT
{
float4 Position : POSITION; // vertex position
};

VS_OUTPUT main( in VS_INPUT Input )
{
VS_OUTPUT Output;

Output.Position = mul( Input.Position, mWorldViewProj );

return Output;
}


This results in the following output:

//
// Generated by Microsoft (R) HLSL Shader Compiler 9.29.952.3111
//
//  fxc ProjectionShader.vsh
//
//
// Parameters:
//
//  row_major float4x4 mWorldViewProj;
//
//
// Registers:
//
//  Name          Reg  Size
//  -------------- ----- ----
//  mWorldViewProj c0      4
//

    vs_2_0
    dcl_position v0
    mul r0, v0.y, c1
    mad r0, v0.x, c0, r0
    mad r0, v0.z, c2, r0
    mad oPos, v0.w, c3, r0

// approximately 4 instruction slots used


As I already said earlier: 4 instructions.
So, can your code do it in 3 instructions or less?

If I understood correctly, this is what you'd need to do for projection alone:
cbuffer cb0
{
float4 Values;
};

struct VS_INPUT
{
float4 Position : POSITION;
};

struct VS_OUTPUT
{
float4 Position : POSITION; // vertex position
};

VS_OUTPUT main( in VS_INPUT Input )
{
VS_OUTPUT Output;

Output.Position.w  = -Input.Position.z;
Output.Position.xyz =  Input.Position.xyz * Values.xyz;
Output.Position.z  +=  Values.w;

return Output;
}


This results in 4 instructions already, and this is not a complete shader, since there is no transform applied to take the input vertex from object space to camera space:

//
// Generated by Microsoft (R) HLSL Shader Compiler 9.29.952.3111
//
//  fxc ProjectionShader2.vsh
//
//
// Parameters:
//
//  float4 Values;
//
//
// Registers:
//
//  Name        Reg  Size
//  ------------ ----- ----
//  Values      c0      1
//

    vs_2_0
    dcl_position v0
    mad oPos.z, v0.z, c0.z, c0.w
    mul r0.xy, v0, c0
    mov oPos.xy, r0
    mov oPos.w, -v0.z

// approximately 4 instruction slots used


So why exactly did we want to do this again?

And while you're at it, answer my last post about clipping in the D3D pipeline thread as well.
Posted on 2012-09-16 03:59:32 by Scali
You can't just pass in a single WVP matrix unless you only have one object in your world. Therefore you are going to either, pass what changed from the major transforms, and multiply them on the shader, or, multiply and pass from the cpu, which you are omitting to mention (or measure).

Here is the ViewSpace to ClipSpace transform, aka Projection matrix stage, without matrices (GLSL shader sourcecode), based on the Vec4 calculated in the previous post.

This plaintext forms part of my surfaceshader generator, which takes in a pair of (vertexformat, material) semantics and generates an appropriate shader sourcecode and compiles it. Only what is being used appears in the source, and shader programs can be shared.


"//perspective project\n"
"vec4 get_projection(vec3 v, vec4 pr)    {\n"
" return vec4( v.xy * pr.xy, v.z*pr.z + pr.w, -v.z);\n"
"}\n"


And I challenge you again to show me how to multiply two matrices in 4 operations.

Posted on 2012-09-16 04:11:28 by Homer

You can't just pass in a single WVP matrix unless you only have one object in your world.


Ah, I see you know about as much about shaders as you do about clipping...
What nonsense is this? Especially since we've already disucssed it earlier.
At any time during the execution of the shader, you are rendering a single object.
You simply update the matrices between objects. Or how exactly did you propose to do it? Stuff the matrices for all objects into the shader at once, and use some kind of index field in the vertices?

Therefore you are going to either, pass what changed from the major transforms, and multiply them on the shader, or, multiply and pass from the cpu, which you are omitting to mention.


I omit to mention it because it's supposed to be common knowledge. It is also completely irrelevant.
Yes, you update shader constants between draw calls. That's the same for all shaders.

And as I said in the previous post, that code takes 4 instructions alone. So how is that better than the normal approach, where you have 4 instructions for objectspace to clipspace, rather than just viewspace to clipspace?


And I challenge you again to show me how to multiply two matrices in 4 operations.


I don't NEED to multiply two matrices. I fail to see the relevance of this question.

Also: answer the D3D Pipeline thread. Why do the clipped triangles have more vertices?
Posted on 2012-09-16 04:24:50 by Scali