Heya :)
I've taken much of the sourcecode used in my OpenGL Tutorials and wrapped it into an ObjAsm32 class.
Attached is a beta of my OpenGL window class... This is what I'm using to power my zombie game.
New features include smoothed framerate capping... this file will be updated periodically.
Hope you find this useful :)
I've taken much of the sourcecode used in my OpenGL Tutorials and wrapped it into an ObjAsm32 class.
Attached is a beta of my OpenGL window class... This is what I'm using to power my zombie game.
New features include smoothed framerate capping... this file will be updated periodically.
Hope you find this useful :)
New features include smoothed framerate capping...
Hi :) I have a question, if I may. Why capping the framerate instead of simply V-synch-ing? Is there any particular use for framerate cap without vsync?
Yes - the way VSync works, when you call SwapBuffers, the thread that called it will Block until its time to display the next frame.
This would normally force us to use other thread(s) for driving our physics, audio and other cpu-intensive stuff ... and in turn this forces us to introduce Mutexes to protect the renderable lists from being accessed by the render thread while being manipulated (we can't delete stuff safely unless we're certain that we're not rendering at the time - and that means interthread communications are now required). So VSync caps the rate of execution of the calling thread to whatever the device sync rate is set to by internally 'Sleeping' (waiting on a timed interrupt-driven event) to force our max. rate of updates (we can still drop BELOW the Sync fps rate by taking too long between frame renderings !!!)
We're telling it no - never block - we can control the framerate ourselves and put that thread's idle time to good work (similar to the PeekMessage technique in WinMain), if we use it to ALSO manage the lists of objects touched during rendering, we have eliminated the need for mutexes here, and also eliminated many unnecessary context switches and other side-effects caused by using an api in a blocking mode.
We get to eliminate all the problems, and yet retain all the benefits.
Anyone who wants to actually build the GLWindow object will need the attached high performance timer class.. think thats the only dependancy.
The GLWindow class has an embedded Timer which it uses to keep track of A) the total running time of the app (App Time) and B) the time elapsed between iterations of the Main Loop (Elapsed Time). These are made available via GLWindow variables for any purpose, but are internally used to implement the smoothed fps capping scheme. It's possible to disable the FPS capping by simply requesting a capping rate that's too high to physically achieve, the overhead of the fps calculation code is quite negligable, just a couple of fpu opcodes and otherwise its all code that we'd still want to run.
This would normally force us to use other thread(s) for driving our physics, audio and other cpu-intensive stuff ... and in turn this forces us to introduce Mutexes to protect the renderable lists from being accessed by the render thread while being manipulated (we can't delete stuff safely unless we're certain that we're not rendering at the time - and that means interthread communications are now required). So VSync caps the rate of execution of the calling thread to whatever the device sync rate is set to by internally 'Sleeping' (waiting on a timed interrupt-driven event) to force our max. rate of updates (we can still drop BELOW the Sync fps rate by taking too long between frame renderings !!!)
We're telling it no - never block - we can control the framerate ourselves and put that thread's idle time to good work (similar to the PeekMessage technique in WinMain), if we use it to ALSO manage the lists of objects touched during rendering, we have eliminated the need for mutexes here, and also eliminated many unnecessary context switches and other side-effects caused by using an api in a blocking mode.
We get to eliminate all the problems, and yet retain all the benefits.
Anyone who wants to actually build the GLWindow object will need the attached high performance timer class.. think thats the only dependancy.
The GLWindow class has an embedded Timer which it uses to keep track of A) the total running time of the app (App Time) and B) the time elapsed between iterations of the Main Loop (Elapsed Time). These are made available via GLWindow variables for any purpose, but are internally used to implement the smoothed fps capping scheme. It's possible to disable the FPS capping by simply requesting a capping rate that's too high to physically achieve, the overhead of the fps calculation code is quite negligable, just a couple of fpu opcodes and otherwise its all code that we'd still want to run.
You'll notice that the GLWindow class implements its own "WndProc" (handler for WM's).
If your application creates dialogs with their own WndProc handlers, or extends the handler in GLWindow, you WILL need to be careful about manipulating your lists of renderable entities in response to WM's, since the Render method is being called by another thread (the "Worker") and if you're not careful you will crash that worker thread (big fat GPF on illegal list access)
You'll need to somehow disable rendering some or all entities while you're messing around with their collections / lists, or find another way to synchronize the rendering thread with the WM handling thread.
Perhaps I'll publish an update which addresses this issue, my SoundSystem class solves this exact same problem by introducing a CriticalSection around the class's 'render' call, so that any attempt to manipulate the list while it's being accessed will block for the remainder of the current frame being rendered... It's far from elegant, but in that class I needed to give as much priority to the 'render' thread as possible - and isn't that what I'm trying to do again here, by disabling vsync etc?
The beauty of this solution, if there is any at all, is that it protects ALL lists of renderable entities, even ones I have not invented yet, in a very centralized and authoritarian way - shall we call this "the big stick approach" ? :D
If your application creates dialogs with their own WndProc handlers, or extends the handler in GLWindow, you WILL need to be careful about manipulating your lists of renderable entities in response to WM's, since the Render method is being called by another thread (the "Worker") and if you're not careful you will crash that worker thread (big fat GPF on illegal list access)
You'll need to somehow disable rendering some or all entities while you're messing around with their collections / lists, or find another way to synchronize the rendering thread with the WM handling thread.
Perhaps I'll publish an update which addresses this issue, my SoundSystem class solves this exact same problem by introducing a CriticalSection around the class's 'render' call, so that any attempt to manipulate the list while it's being accessed will block for the remainder of the current frame being rendered... It's far from elegant, but in that class I needed to give as much priority to the 'render' thread as possible - and isn't that what I'm trying to do again here, by disabling vsync etc?
The beauty of this solution, if there is any at all, is that it protects ALL lists of renderable entities, even ones I have not invented yet, in a very centralized and authoritarian way - shall we call this "the big stick approach" ? :D
Perhaps a non-blocking list implementation (or at least less blocking) would be more efficient than wrapping a time sensitive call in a critical section.
Assuming the list is a linked list of render-able entities, adding an entity to the end of the list can be done without blocking and removing an entity can be done by simply setting a flag on that entity. Then the render proc can check this flag and cull the list accordingly.
You could at least make the critical section smaller by putting it inside a branch for list culling. If no entities need to be removed from the list then the critical section won't be started during that Render iteration.
*Note* I haven't delved into your implementation, so the above may be completely irrelevant.
Assuming the list is a linked list of render-able entities, adding an entity to the end of the list can be done without blocking and removing an entity can be done by simply setting a flag on that entity. Then the render proc can check this flag and cull the list accordingly.
You could at least make the critical section smaller by putting it inside a branch for list culling. If no entities need to be removed from the list then the critical section won't be started during that Render iteration.
Node = List_Head_Node;
while( Node.Next != NULL )
{
TempNode = Node.Next;
TempObj = TempNode.Value;
if( TempObj.GarbageCollect == false )
{
TempObj.Render();
Node = Node.Next;
}
else
{
CriticalSectionStart();
IsLocked = true;
Node.Next = TempNode.Next;
TempObj.Dispose();
}
}
if( IsLocked == true )
{
CriticalSectionEnd();
}
*Note* I haven't delved into your implementation, so the above may be completely irrelevant.
Perhaps a non-blocking list implementation (or at least less blocking) would be more efficient than wrapping a time sensitive call in a critical section.
Assuming the list is a linked list of render-able entities, adding an entity to the end of the list can be done without blocking and removing an entity can be done by simply setting a flag on that entity. Then the render proc can check this flag and cull the list accordingly.
Yes, I think it's a safe assumption in this situation that the render thread will only remove items from the list, and the other threads will only add items to the list. You can use that to your advantage.
Instead of a simple list, you could create a list of nodes where each node contains an array of renderables. Inside the array you can just advance the pointer with an InterlockedAdd() operation, much cheaper than a critsect.
Another approach is to have a thread generate a display list, and pass it to the rendering thread when the list is complete.
That way there is no concurrency, and locking is not required at all.
DirectX 11 actually supports a model like this directly through its interface.
Of course, CriticalSections are not my preferred mutex, and especially since we (OA users) already have our own lockfree mutex (thanks Biterider!) which can be applied per Class Method invocation, thus only access to a particular method, object, class or classes is mutexed, allowing other threads to continue doing what they were doing.... much better than halting execution of all concurrent threads during the critical code section, and not subject to race-conditions that can occur with Flag approach (unless we are careful and use the Lock prefix and etc).
The OpenAL audio engine worker thread is a special case that required as much cpu time as possible under high loads - and that example does indeed avoid the ciritcalsection and instead it waits on a single event whenever there is NO workload - but scales poorly inbetween.
Problems caused by OpenAL requiring us to use a POLLING method to refill streaming buffers.
Just how expensive is it to generate and destroy a big display list per frame? (OpenGL)
In my D3D/BSP engine, I used a deferred rendering technique where I collected renderables per frame into a list underneath each referenced texture... this way I could absolutely eliminate texture thrashing, and in that engine I was already forced to enumerate almost to the triangle level, so it was no biggy to do that.
The OpenAL audio engine worker thread is a special case that required as much cpu time as possible under high loads - and that example does indeed avoid the ciritcalsection and instead it waits on a single event whenever there is NO workload - but scales poorly inbetween.
Problems caused by OpenAL requiring us to use a POLLING method to refill streaming buffers.
Just how expensive is it to generate and destroy a big display list per frame? (OpenGL)
In my D3D/BSP engine, I used a deferred rendering technique where I collected renderables per frame into a list underneath each referenced texture... this way I could absolutely eliminate texture thrashing, and in that engine I was already forced to enumerate almost to the triangle level, so it was no biggy to do that.
I didn't literally mean OpenGL's own display lists, because afaik most modern operations such as using vertex buffers and shaders etc are not stored in a display list. So those display lists have become pretty useless for modern applications.
But making your own list of drawing commands (make some kind of object that stores all the info required to draw a renderable item, so it can be fired off in one call). That can be pretty cheap, depending on how much you information you can prepare at application startup time and re-use. In most cases, a renderable mainly needs its transform matrix (or matrices in case of skinning) updated every frame. Geometry is usually static, or at least kept in the same VBO, so the reference to the VBO is static. Same for shaders, textures etc.
So effectively you'd mainly be building a list of references to static renderable objects. Which can be done pretty cheaply with a good concurrent container datastructure.
As for critical sections... you don't halt execution of all threads, only the ones that are also trying to lock the critsect at that time (first come, first served).
The main problem is that trying to lock a critsect can be pretty expensive (although less expensive than a mutex... main difference is that a cs is process-wide, and a mutex is system-wide. Basically don't use a mutex unless you need to sync multiple processes). It is a kernel object after all. So if you can avoid it, you should.
But making your own list of drawing commands (make some kind of object that stores all the info required to draw a renderable item, so it can be fired off in one call). That can be pretty cheap, depending on how much you information you can prepare at application startup time and re-use. In most cases, a renderable mainly needs its transform matrix (or matrices in case of skinning) updated every frame. Geometry is usually static, or at least kept in the same VBO, so the reference to the VBO is static. Same for shaders, textures etc.
So effectively you'd mainly be building a list of references to static renderable objects. Which can be done pretty cheaply with a good concurrent container datastructure.
As for critical sections... you don't halt execution of all threads, only the ones that are also trying to lock the critsect at that time (first come, first served).
The main problem is that trying to lock a critsect can be pretty expensive (although less expensive than a mutex... main difference is that a cs is process-wide, and a mutex is system-wide. Basically don't use a mutex unless you need to sync multiple processes). It is a kernel object after all. So if you can avoid it, you should.
Yeah, when I say 'mutex', I refer to mutual exclusion mechanisms in general - not the windows kernel one.. ours is based on the "lock cmpxchgb" opcode iirc. ;) And yeah I realize that one criticalsection won't cause other threads to stall - but try to use cs to solve general interthread exclusion and you quickly end up with thread interlocking caused by poorly timed accesses to mutexed resources - I've made such mistakes and learned from them.
You're dead right about the Render thread being in charge of removing dead wood, and other threads being responsible for list insertions.
Yeah ok - not real display lists, but custom lists of polytope indices (which we can then rattle off into an indexbuffer , generally)... yeah that's exactly how that old BSP engine I mentioned worked - I was able to quickly locate visible "buckets of triangle indices", referencing into static VB's, and batch off IB's for the current frame.
You're dead right about the Render thread being in charge of removing dead wood, and other threads being responsible for list insertions.
Yeah ok - not real display lists, but custom lists of polytope indices (which we can then rattle off into an indexbuffer , generally)... yeah that's exactly how that old BSP engine I mentioned worked - I was able to quickly locate visible "buckets of triangle indices", referencing into static VB's, and batch off IB's for the current frame.
I don't really feel like having this discussion again... but locking an indexbuffer, (re)writing it and then rendering it is also a huge concurrency problem. This time between the CPU and GPU.
In most cases it's cheaper to just render the full indexbuffer than it is to try to be smart and render only the visible triangles. A GPU is THAT much faster than a CPU.
Which is why leafy BSPs were invented... That way you don't have to rewrite your indexbuffers, you do things per-indexbuffer rather than per-triangle. Which means that the CPU doesn't have to touch the GPU's memory, and there's no concurrency problem.
But even that is becoming increasingly inefficient as you need larger and larger indexbuffers to keep the GPU happy (it eats through them at alarming rates, millions of triangles per second).
In fact, these days we even use the GPU to render an entire bounding box, and count the pixels, because it's faster to do that than to let the CPU determine visibility using a geometry-based solution rather than bruteforce. Rendering millions of pixels is faster than testing a dozen geometric shapes.
In short: keep your geometry static, unless you really REALLY have to rewrite data. It can literally mean orders of magnitude faster rendering.
Use coarse visibility determination, with bounding volumes around movables of pre-built static rendering data.
By the way, obviously MS has also optimized their synchronization objects... They use a spinlock on the first try, and put the thread into sleep only after that (if you don't, you're just eating up a core, preventing it from running other threads in the meantime, another concurrency problem)... But even so, there's overhead involved in sleeping threads, and signaling them to wake again. I doubt that you can do better by rewriting it manually, really.
In most cases it's cheaper to just render the full indexbuffer than it is to try to be smart and render only the visible triangles. A GPU is THAT much faster than a CPU.
Which is why leafy BSPs were invented... That way you don't have to rewrite your indexbuffers, you do things per-indexbuffer rather than per-triangle. Which means that the CPU doesn't have to touch the GPU's memory, and there's no concurrency problem.
But even that is becoming increasingly inefficient as you need larger and larger indexbuffers to keep the GPU happy (it eats through them at alarming rates, millions of triangles per second).
In fact, these days we even use the GPU to render an entire bounding box, and count the pixels, because it's faster to do that than to let the CPU determine visibility using a geometry-based solution rather than bruteforce. Rendering millions of pixels is faster than testing a dozen geometric shapes.
In short: keep your geometry static, unless you really REALLY have to rewrite data. It can literally mean orders of magnitude faster rendering.
Use coarse visibility determination, with bounding volumes around movables of pre-built static rendering data.
By the way, obviously MS has also optimized their synchronization objects... They use a spinlock on the first try, and put the thread into sleep only after that (if you don't, you're just eating up a core, preventing it from running other threads in the meantime, another concurrency problem)... But even so, there's overhead involved in sleeping threads, and signaling them to wake again. I doubt that you can do better by rewriting it manually, really.
I didn't literally mean OpenGL's own display lists, because afaik most modern operations such as using vertex buffers and shaders etc are not stored in a display list. So those display lists have become pretty useless for modern applications.
AFAIK, they are very efficient for quick state transitions (something like D3D's state blocks). I remember at least 1 doc where Nvidia recommends to use them for this purpose.
AFAIK, they are very efficient for quick state transitions (something like D3D's state blocks). I remember at least 1 doc where Nvidia recommends to use them for this purpose.
True, they're still a good solution for situations where you can use them.