(Only works in compatibility mode for XP, else it complains that there is not enough memory (probably uses a signed 32-bit integer, then it sees a VERY large amount of memory from within my 64-bit environment)

Is there a complete list of all things that change when you run apps in each of the compatibility modes? Google wasn't very helpful.
Posted on 2010-08-21 11:32:00 by ti_mo_n

Because you spent time writing the code? :)

Yea, good point...
It's even worse because the code is in an svn repository anyway. If I ever wanted it back, it would be easy.
But yea, there are a number of reasons why I could hang on to the D3D10 code... none of them are very good reasons, but still :P
Posted on 2010-08-23 03:01:17 by Scali

(Only works in compatibility mode for XP, else it complains that there is not enough memory (probably uses a signed 32-bit integer, then it sees a VERY large amount of memory from within my 64-bit environment)

Is there a complete list of all things that change when you run apps in each of the compatibility modes? Google wasn't very helpful.

I've never seen one... I think it's just one of those things that Microsoft doesn't document.
This blog says pretty much the same as what I usually do though, trial and error:
The question of what compatibility mode to choose is mostly a matter of trial and error. The general rule is to choose the last version of Windows with which you knew the application worked. Sometimes the application will actually tell you what version of Windows it was written for and is expecting. In general, I wouldn't change the display settings until you've tried all the available compatibility mode settings and are still having corrupted video or the window fails to appear. The least intrusive video change is to disable visual themes — try that first. Often if an application's windows aren't appearing or are distorted, this will resolve the issue. Next I'd try the 256 color option, and only resort to 640x480 if you must. Who wants to run their computer at that resolution with today's larger monitors and more powerful video cards?

So I generally just start at XP SP3, and then work my way down.
Here's some interesting alternatives though, like having Win7 troubleshoot the compatibility automatically, or hacking it in manually via regedit (also nice to check what settings it currently uses, and disabling it completely): http://www.sevenforums.com/tutorials/316-compatibility-mode.html
Posted on 2010-08-23 03:06:53 by Scali

I suppose theoretically I could convert my MFC code to .NET WinForms. That way it will be 'VS Express'-compliant.

Well, for the VJ-tool I've been working with .NET... The main GUI will be made with C#. So I figured I'd try to wrap my engine code in a managed C++ wrapper, so that it can be called from C# easily.
It seems that I've done a good job earlier when I split my engine up into a loader executable and an MFC dll. Namely, the window-handling code was separate from the main engine class and all the actual D3D functionality. The engine only required a HWND and a HINSTANCE, but nothing MFC-specific.
So I could very easily split up the code into a static library containing the main engine functionality, and the MFC window/application code.
The MFC DLL now uses the static library to wrap a window around it, and handle basic operations such as resizing.

Then I built another DLL, this time using managed C++ instead. I basically wrapped a managed .NET interface around the existing C++ code. Then I used this interface in a C# application, so the engine now runs inside a .NET form, rather than in an MFC application/window.

So that part went quite smoothly so far. Technically I have now probably achieved what I mentioned earlier: The engine can be used in an MFC-free environment, and should be able to work from Visual Studio Express as well.

Next up is to expand the interface, so that the C# code will get more control over the engine, so that we eventually can use it for the VJ-tool and demo editor in C#.
Posted on 2011-01-09 15:00:38 by Scali
Well, I decided to put my theory to the test... I discovered a bug in the D3D9 code on XP. Apparently the reset doesn't work because one of the dynamic resources is not destroyed properly. Not too surprising, as I have refactored a whole lot of code, and made things a lot more reference-counted than before. The D3D-code is now much more similar to the OpenGL code. But apparently I missed something there.

I cannot reproduce this problem on Vista/Windows 7 (because of the new driver model, devices don't actually get lost anymore, and the reset is never triggered), so I have to debug it on XP. Since I hadn't updated the Visual Studio on my XP installation since 2005, I had to install a fresh copy of 2010 there. I decided to try and kill two birds with one stone, so I installed the Express version instead of the Professional that I normally use.

Anyway, although theoretically my main engine library is now completely MFC-free, obviously things didn't work that smoothly yet in practice. I was still using the Afx headers rather than vanilla Windows headers, which don't exist in an Express installation. Likewise it also was set up to link against the MFC libraries rather than the vanilla C++ libraries, which also don't exist in an Express installation. Sure, you can download and install the Windows SDK for free, which contains the required headers and libraries... but that sorta defeats the point. Instead of making your code work in Express, you're just converting your Express installation to a fullblown Visual Studio. Besides, Express doesn't contain the resource editor and MFC wizards, so although the code will compile, you cannot edit the MFC windows in the IDE anyway. So what's the point?

So I decided to fork my current engine into an Express branch. And then I removed all the remnants of MFC which were still present. It was mainly the A2W() and related macros that were causing the actual trouble. For the rest it was just some project settings, and a few #include statements that needed replacing, or in one case should just be removed altogether.
I decided to just make my own clones of the A2W()-etc macros, since they are pretty useful. Once that worked, I could compile the engine in VC++ Express. The managed C++ wrapper could also compile smoothly now.
The C# code just worked as-is. There seems to be a lot less disparity between the Express C# and the full Studio C# environment than in the case of C++.

So anyway, I have now achieved my goal of being compatible with Visual Studio Express, with my D3D engine.
I also gave the OpenGL engine a whirl, and that went extremely smoothly... Not too surprising, as it uses Glut to take care of the windows, and it does not use any kind of Unicode or anything. Aside from that, the code itself was already multi-platform, and could work in other IDEs, on other OSes.

It's always a good test to try and get your code compiling from scratch... and changing IDEs/compilers will often reveal some small bugs or other messy things, which would have probably gone unnoticed forever otherwise. My code is now a bit cleaner, some bugs have been fixed, and the code is less dependent on things that it should not depend on.

I haven't actually gotten round to fix the reset-problem though, but that's next on the list. After that, I will be trying my hand at wrapping things the other way around... passing a managed C++ object to the C++ code. That is how my setup dialog currently works. I want to recreate that dialog in managed C++ or even C#, and pass it to the engine. Mostly as a proof-of-concept at this point, because the concept of being able to pass .NET objects to native C++ code will be quite useful for the VJ-tool.

After that, I will be modifying the animation code in the engine. Currently all animation is assumed to be based on a set of rotation and position keyframes. That is not very practical for realtime user interaction. So I will encapsulate the animation into a new interface, so I can plug any type of animation into any object. And such an animation controller can also be passed down from the C# code, so not everything will necessarily have to be done in C++.

Posted on 2011-01-24 03:07:34 by Scali
Just so you're aware, you can still download 2008 'express edition' as a full version (ie, it won't expire like the 2010 demo will).
Yeah I went through the same stuff with regards to not correctly releasing everything, it was a pain to track down the culprits.
That lead to me implementing centralized resource managers for various resource types, with reference counting as you mentioned.

In regards to keyframed animations not being flexible, this is exactly how D3D's animation track event queue works, what's wrong with it?
You can inject whole keyframes, or specific 'reack events' (such as changing the blend weight, speed, etc) into an existing animation track, or more likely, into a 'spare' track designed to be blended with the other(s). It does allow for a lot of control, albeit it would be nice to see decent animation blend editors that cater to this scheme.

Posted on 2011-01-24 03:38:36 by Homer

Just so you're aware, you can still download 2008 'express edition' as a full version (ie, it won't expire like the 2010 demo will).

As far as I know, the 2010 version will not expire either. It just asks you to register within 30 days (which is free).

Yeah I went through the same stuff with regards to not correctly releasing everything, it was a pain to track down the culprits.
That lead to me implementing centralized resource managers for various resource types, with reference counting as you mentioned.

Well, I use the managed pool whenever possible. But for dynamic resources, I have created my own interface with an InitResources() and DestroyResources() method.
Any class that needs to be able to deal with a reset, will implement this interface, and then register itself with the engine.
When the engine resets, it will then call DestroyResources() for all registered objects, reset the device, and call InitResources() so that all dynamic resources are restored, and execution can continue transparently.

Problem is, while refactoring a lot of my code, I have apparently broken some of this stuff. I guess the reference count of one or more objects will never reach 0, so the Reset() fails.
A helpful tool is this simple thingy:
UINT GetRefCount(IUnknown* pUnk)
    return pUnk->Release();

I guess the problem I run into is that some resources may be shared with other objects, so just calling a single Release() from DestroyResources() will not do the trick.
The indirect problem is more difficult. I can just call Release() until the count is 0, and then Reset() will work. But this means that some objects still have a reference to the old object. If I create a new object in InitResources(), these references need to be updated to point to the new value. I will probably have to wrap a layer around these objects, so that the reference can point to this layer, and the actual resources can be updated internally without a problem.

In regards to keyframed animations not being flexible, this is exactly how D3D's animation track event queue works, what's wrong with it?
You can inject whole keyframes, or specific 'reack events' (such as changing the blend weight, speed, etc) into an existing animation track, or more likely, into a 'spare' track designed to be blended with the other(s). It does allow for a lot of control, albeit it would be nice to see decent animation blend editors that cater to this scheme.

Well, as I say, for realtime user interaction it's not that nice.
What if you just want to have mouse control on your object? So that the user can rotate the object directly with the mouse?
Sure, you could delete the animation track for every mouse event, and put a new 'animation track' in there, which does nothing more but reflect the current mouse position... but it can be much simpler. Namely, if I just have an alternative AnimationController which simply takes some rotation parameters and generates the rotation matrix on-the-fly, it's much easier to interact with the mouse directly. Because that's what you want when you're VJing... you want to manipulate objects directly.
I can think of various other examples where keyframes may not be the ideal solution (for example, having the music control various animation parameters via an FFT function), but you get the idea, I suppose?
Posted on 2011-01-24 04:12:48 by Scali

Yeah certainly - replacing the default animationcontroller/mixer is certainly one path.
The other is to create specialized controllers 'on top' of D3D's one.
Most of the game engines I've looked at recently feature one or more 'character controllers' which can be quite generic ('biped controller') or more specialized ('zombie npc controller'), typically implementing code which throws commands to the default animation mixer.
I guess these examples are all really FSMs, eating a finite input set.
This includes inverse kinematics stuff too ('biped stair climber'), since we have a good idea of the 'target state'.

Your FFT example is a really good candidate for bypassing the default mixer though, since the input is not finite, a FSM approach seems unsuitable indeed.

Finally, writing your own mixer allows you to go crossplatform, given that opengl has no similar functionality... whereas building on top of D3D's mixer straps your engine forever to Windows OS.

Posted on 2011-01-24 09:30:36 by Homer
I never use any of D3D's animation code in the first place.
Posted on 2011-01-24 10:57:45 by Scali
Ah, the problem was not even related to the resource handling in the first place.
It kept resetting because I set an invalid light. Apparently I couldn't be bothered to define a complete lightsource, so I only filled out whatever my shaders required.
Problem is, in D3D9, the light actually gets sent to the fixed function pipeline, and it doesn't like it. Which triggers a reset.
So, fixed up the light, et voila, it works in XP/D3D9.
As a bonus, it should now also run without shaders, on my ancient laptop :)
Posted on 2011-01-24 12:43:06 by Scali
In the past two weeks I've been renovating my ancient laptop. It's a Celeron Northwood 1.6 GHz with 512 MB of memory, a 20 GB HDD, and Windows XP Home. The video chip is an ATi Radeon IGP340M, a derivative of the Radeon 7000-series, which is DirectX 7-class, and has OpenGL 1.3 support.
Initially it ran like a dog, because the HDD was nearly full, and over the years, the service packs and updates had installed a lot of stuff that you really don't want on such a low-spec machine.
So I decided to clean up the HDD as much as possible. I found this very cool tool for it: Free Disk Analyzer.
I managed to free up nearly 7 GB, and the machine's performance started to improve. It still used a lot of memory though, even when completely idle. One thing that really bogged down performance was the automatic checking for Windows Updates. It takes a lot of memory and keeps the HDD occupied for a long time, when you first boot up the system. So I disabled that. I'll just manually check for updates via the website from time to time.
Another thing is Security Essentials. I decided to disable that completely (not just disabling realtime scanning, but removing the services from memory altogether), and just run a manual scan from time to time.

Now the machine only takes about 200 MB on a fresh boot, and it's quite quick to start and be ready for use, since it doesn't have to start all these services.
So, I decided to go crazy and put Visual Studio 2010 Professional on it. This was the initial goal for the renovation anyway: I wanted to test my OpenGL code on it. And I have to say, it works better than I expected. It takes quite a long time to start up, but after it's loaded, it doesn't even take up that much memory. The total memory use is still under 400 MB, very different to how it behaves on machines with a lot more memory.
The machine is very slow to compile, but in the end everything DOES work, which is pretty cool.

And I mean that very literally: after a few small fixes, my OpenGL code worked on the thing! Replaying the Croissant 9 content quite nicely. In fact, the framerate in this case wasn't that far off from my newer Core2 Duo laptop with a DX10-spec Intel IGP.
Since I had fixed the XP/D3D9 bugs in my D3D code, I decided to give that a try as well. I couldn't get myself to cut out the fixed function support from the D3D9 codebase, so I had left that in the current D3D9/10/11 framework. As a result, I could easily get it working on the fixed function pipeline of this ancient DirectX 7-class IGP.

It wasn't even that bad either. It didn't manage to reach the full framerate of the 720x576 test movie that I used on my DirectShow texture, but it wasn't that far off, and everything worked. So that was pretty cool. With a lower resolution movie (VJs tend to use around 400x300 anyway, it's not all that high-res. If we pull off full HD with this project, we'd actually be groundbreaking in a way), this machine may actually be useful. That was part of the idea really... "How low can you go?" One idea is to have a few computers networked together, each computer driving one projector. By making the code as efficient as possible, we can really cut costs on the hardware. If we can coax enough performance out of simple IGPs, we could use cheap barebone systems, or perhaps even old surplus hardware. At this point it appears that the CPU is a more important factor than the IGP. The laptop seems to be let down by the single-core Celeron, which has trouble doing both the video decoding and the 3D rendering at the same time. On a dual-core machine, the 3D rendering and video decoding are almost completely independent of eachother, and run nicely in parallel, with the 3D renderer reaching virtually the same framerates as with static textures. At this point it seems that dualcore or quadcore systems with an Intel IGP would be sufficient, whereas systems with a single-core CPU but a high-end GPU may not cut it (I could dust off my Athlon XP 1800+ with Radeon 9600XT to test that theory. It's probably slower at video decoding than the Celeron because it lacks SSE2. The videocard on the other hand is much more powerful).

Aside from that, I've also been busy hacking the animation code out of my framework, and introducing a new AnimController system. The result is that the code is even cleaner now, and I have full control over the animation. As a simple test, I wrote an AnimController which takes two angles, and builds a rotation around the X and Y axis from those. Currently I just feed it some angles derived from the timer, but the next step is to implement an AnimController in C#, and take mouse input, so you can manipulate the rotation in realtime.
Posted on 2011-01-26 04:12:55 by Scali
Right, the C# AnimController works like a charm now.
I have to amend my previous post though: the estimates for IGP performance were based on the test resolution of 640x480. If I want to render at 1280x720 or 1920x1080, then my Intel IGP struggles to sustain a decent framerate, no matter what you render (or don't render, just clearing the screen and zbuffer is difficult enough). So although the streaming to texture seems mostly dependent on CPU-performance, driving an HD resolution output needs something more than a basic IGP (at this point. I'm sure Intel's latest CPUs with integrated GPUs will be good enough, same for AMD's variations).
Posted on 2011-01-27 05:44:27 by Scali
Okay, I've been busy with dynamic textures for the most part.
I have implemented a nice base class which does the grunt work for updating a texture with new pixels, and handling the destruction and recreation required for window resizing and all that.

Then I have created a subclass using DirectShow, which you can use to stream a video file to a texture, or to capture live data from a camera or other capture device (eg a TV card).

Since we have also added support for hosting Flash controls in our UI, I was wondering how difficult it would be to host a Flash control on a texture.
As it turned out, there doesn't appear to be any kind of support for capturing frames from Flash. However, the Flash OCX *is* a Windows control. Which means it has a window handle. So I figured: if it has a hwnd, I can probably get its DC, and capture it in a bruteforce way: just BitBlt its contents to a DIB, et voila, I can access the pixels.
So I first created a simple proof-of-concept application, which had two panels. The left one hosted the Flash control. The right one would make a copy via DIB and render the DIB.
It worked, so there we have it. Next step was to actually put the pixels into the D3D texture, but that was just a formality at this point, already having the working base class from the DirectShow stuff.
The fun part is that it is a very generic and bruteforce approach: the D3D-related code doesn't know anything about Flash. It just gets a window handle, and does its BitBlt magic. I can capture any window on the screen, and host it on a texture. Even the desktop itself.
The performance for Flash isn't even that bad. I've added an optimization to check the CurrentFrame property of the Flash control before rendering each frame, and only updating the texture whenever there is a new frame, to avoid a lot of redundant copies (the D3D renderer runs at WAY higher framerates than your average Flash movie).

The beauty of it all is that both DirectShow and Flash are COM objects, which can run in their own threads. This means that the D3D renderer can just render as fast as it wants on the first CPU core, while the DirectShow or Flash decoding/rendering can take place on the other cores of the machine independently. The renderer only needs to update the texture whenever a new frame is completed by another core (you cannot update the texture from another thread because it cannot be updated while being used for rendering, and the D3D objects don't like to be used in more than 1 thread anyway). So it is a very asynchronous process by nature, and a very good case for multi-core systems. Basically the more cores you have, the more dynamic content you can host on textures.
Posted on 2011-02-14 06:39:52 by Scali
I decided to also look into GDI interop.
I recalled that D3D9 surfaces supported a GetDC() call, after which you could just use any GDI function on it. So rather than first BitBlting to a DIB, and then locking, copying pixels to texture, and unlocking, I could just BitBlt them straight onto the texture.

I tried that, initially it actually seemed a tad slower than my bruteforce approach, but then I noticed that the framerate slowly picks up, and after running for a few seconds it seems to settle at pretty much the same speed as the DIB implementation.

Then I decided to look into what D3D10/11 can do... I knew that D3D10 removed the GetDC() functionality... However, with Direct2D and all, and the updated DGXI layer for D3D11, they have reintroduced the functionality. And even nicer, it works on both D3D10 and D3D11 now (the beauty of COM interfaces, just query for a given IID, you never know if it may be supported :)).

So now I have made the implementation slightly nicer by using GDI interop in all three APIs. GDI interop may come in handy for other stuff in the future as well.
Posted on 2011-02-14 10:31:45 by Scali
After a bit of a struggle, I now have mipmapping working as well, in all APIs.
In D3D9, automatic mipmapping worked okay after a GDI BitBlt, but not after Lock/Unlock and manually updating (well, it generated the mipmaps on the first update, but not anymore).
In D3D10/11, you could not enable automatic mipmapping at all, either on a dynamic texture or on a GDI-enabled texture.

So I used a simple workaround:
My dynamic texture will be just a single level texture, no mipmapping.
Then I create another texture of the same size, but without requiring CPU access/GDI interop, and enable mipmaps on there.

After updating my dynamic texture, I copy its contents to the largest mipmap of the second texture, and then trigger the mipmap generation.
Then I render using the second texture, et voila, I have lovely mipmaps, so I can get silky smooth anisotropic filtering, even on DirectShow movies, video captures, or Flash animations.
And since it's all GPU-accelerated, the copy and generation of mipmaps is virtually free. You actually save some performance since the texture fetching is far more efficient with the proper mipmap, rather than the 1024x1024 uber-texture that I used.
Posted on 2011-02-14 16:11:21 by Scali
This project is a management nightmare in a way :)
I currently support 3 rendering APIs, two programming languages, two architectures (x86 and x64), and three OSes (XP, Vista, 7). And that's not even including the different hardware it can run on (I use an Intel X3100 IGP on my laptop, GeForce GTX460 on my desktop, and my programming buddy uses a Radeon 4000-series).

This causes a few surprises here and there.
For example, I found that my trick with BitBlt() works differently in XP than it does in Vista and Windows 7. This probably has to do with how Vista and Windows 7 use Direct3D internally, and render all windows onto a texture, where clipping is not done at GDI drawing level, but by the GPU's zbuffer.
This meant that BitBlt copied the entire window contents in Vista/7 (even if you moved the window outside the viewable desktop area), but in XP, it would copy the screen area verbatim. Which means that overlapping windows would also show up in the copy (which I'm not sure to be correct behaviour... I only use the SRCCOPY flag, the CAPTUREBLT flag seems to exist specifically to capture all windows on top, so not specifying it would imply that it will not capture them), and if the window was partially off-screen, that area was black, because GDI clipped it.

So, I decided to look for an alternative. I found that Microsoft has added the PrintWindow() function in XP to make a complete copy of a window, without having to worry about windows on top, or it being partially off-screen.
This worked nicely in XP... however in Vista, it DOES clip the window... but only on my 32-bit Vista machine with Intel IGP. Which could be either the 32-bit version not working exactly like the 64-bit version, or the Intel driver doing things differently from the nVidia driver. Oh well, since the BitBlt method works okay in that scenario, I have a workaround. I guess I will have to add a switch so the user can choose the method that works best.

Another few peculiarities I ran into, were related to how D3D9 responds under the different OSes.
One thing was that under XP x64, the D3D9 runtime reset the device when I specify an invalid light (I mainly set the light so I can read back some info in the shaders, I don't need a completely valid lightsource)... but it seems that it only does that in 64-bit. The 32-bit build didn't seem to have that problem, and just rendered on, without resetting. Same on vanilla XP 32-bit.

Another thing is with texture formats. For the GDI interop (and rendertargets in general), D3D9 doesn't want an alphachannel, so you want an X8R8G8B8 format rather than A8R8G8B8. But, only XP seems to actually enforce that. Vista and 7 just accept an alphachannel just fine (and that half makes sense, as D3D10+ are the opposite: only fully specified formats, with alpha, can be used as rendertarget).

And now I found that in Vista 32-bit on my laptop, I get a problem that the B8R8G8A8_UNORM format exists in DXGI, but is deprecated, so I cannot use it for a rendertarget. Weird. So that means I'd have to use an A8G8R8B8_UNORM format instead... HOWEVER, that format has the *exact* reverse pixel configuration to the X8R8G8B8 format in D3D9. So I have to be careful to only use that on rendertargets that I never actually access with the CPU, otherwise all my colours will be reversed. Ugh.
Posted on 2011-02-17 07:13:22 by Scali
Right, since we will be requiring quite a bit of post-processing, I went and wrote some code to render to a fullscreen quad first. In D3D10+, you NEED to use a vertexshader at all times, you can't just pass XYZRHW coordinates directly to the rasterizer stage anymore. Even if you wanted to pass XYZRHW coordinates, you would have to create a passthrough shader to make them work.

Anyway, it's a bit of a blessing in disguise, since I can now use a fullscreen quad that is -1...1 size, and it will be stretched to fullscreen automatically by the viewport projection step that is done after the vertexshader. This means I can just use a static vertexbuffer, no need to update it when the screen size changes.
And since D3D10+ has a 1:1 pixel:texel mapping, I don't need to do any kind of corrections either.
For D3D9 backwards compatibility though, I have added some shift factors to the vertex shader, so that it can shift the positions to map the texels to pixels (which is relative to screen space, now that you have normalized coordinates in the shader). Still nicer than updating the vertexbuffer itself, like I used to do (then again, that code is from DX7 days, so back then I didn't really have a choice. I just never bothered to rewrite it with shaders).

Once the basics of rendering a fullscreen texture worked, I also worked on render-to-texture, and then I made my own StretchRect()-clone. That was a very useful function in D3D9, but since it was basically just a wrapper around some standard D3D calls, Microsoft decided not to supply one for D3D10+. You have to roll your own now.

With this code in place, I could now do some nice post-processing. One of the effects I wanted to do was a boxfilter. I first started with a naive pixelshader version, which simply sampled the surrounding texels and averaged them. This caused a lot of redundant texel writes, and proved to be very inefficient for larger filter kernel sizes.

Back in the DX7 days I had developed an implementation, exploiting the fact that a boxfilter is separable into horizontal and vertical passes, and using the two texture stages of my GeForce2 at the time (those were the days) to build a filter with logarithmic complexity.
I would set the same texture to both stages, and align them so that I could average two samples at each pass.
For example:
Pass 1 produces averages of pixels: (0+1), (1+2), (2+3), (3+4) etc.
Pass 2 uses these factors to generate: (0+1+2+3), (1+2+3+4), (2+3+4+5) etc.
In short, a filter of size N in one direction takes log(N) passes. A filter of NxM would take log(N)+log(M) passes.
This turned out to be extremely efficient, even back in those days. You could run extremely large filter kernels, blurring the screen beyond recognition in realtime.
I still have an old demo online which used this boxfilter effect, see here. If you still have an old box with a GeForce2 or similar card, you should try it, to see just how fast it still is.

Anyway, I have now ported that routine over to the D3D10+ framework. The code always had a deficiency though, and that is that it only supported kernels with sizes that are powers-of-two, because of the way it works. There is a solution to that, and it is actually not all that difficult: if you use an extra accumulator rendertarget, and factor your kernel size into powers-of-two, you can accumulate all the terms as you go through the passes, and support any kernel size.

I've decided to do some other experiment though, before implementing this improvement on this boxfilter algorithm. And that experiment was (drum roll): Compute shaders!
Yes... there was something I wanted to do for a while now. My Java engine used a boxfilter based on the Summed Area Table algorithm (SAT). It consists of two passes. First pass builds a table with the sums of all pixels from left-to-right, and top-to-bottom. The second pass does the actual boxfilter by taking the values from the four corners of the box from the SAT and subtracting them to get the sum of just that box.
This means that the filter has O(1) complexity: it has constant speed, regardless of kernel size. Building the table is always the same, and the lookup always needs only 4 samples from the table, regardless of size (granted, because of caching, larger kernels tend to be a tad slower, since the samples have poorer cache coherency).
This filter can be seen in action in the Croissant 9 demo in a few places, such as the opening scene.

The reason why I could never implement this algorithm on GPU is because building the table requires you to sum up neighbouring pixels. You can not read back from a rendertarget directly. Only in a following pass.
However, Compute Shaders do not have this restriction. They support Unordered Access buffers. These buffers can be read from and written to in any order. This gives you the same freedom as a regular CPU in terms of memory access. This is nothing short of a revolutionary feature of modern GPUs.

So, I decided to put my D3D11 codebase to good use for a change, and I tried to build a Compute Shader that calculates the SAT. I did the actual sampling of the table in a regular pixelshader. On downlevel hardware (Direct3D 10 hardware can also support Compute shaders, CS4.0 or 4.1 rather than CS5.0), you have the restriction that textures cannot be used as unordered access views (the Compute Shader equivalent of a render target, so to say. So you cannot output directly to a texture). Pixel shaders can read unordered access buffers though, so with an extra pixelshader pass, you can copy your compute shader output buffer to a 'real' texture. I figured I could kill two birds with one stone, and do the table-based filtering in the pixelshader, while rendering to a texture.
And, after a bit of a struggle, I managed to get it to work:

I have to say though, the performance is not as good as I hoped it would be. Since it is ~constant time, I can get ~700 fps regardless of kernel size... But that is also its 'top speed'. The above-mentioned multipass algorithm dating from the DX7 era is much faster for small kernels. The scene renders at about 5000 fps with no filter applied, and doesn't drop to around ~800 fps until I pump the filter up to about 512x512 size. By then the scene is already blurred beyond recognition... so sadly this means that the point where the SAT algorithm starts outperforming the multipass algorithm is past the area of interest.

Anyway, it was fun writing some REAL D3D11 code, and it was educational to get the hang of how Compute Shaders fit into the D3D11 model, and how they cooperate with textures and pixelshaders (I have dabbled with Cuda a bit, when I had my GeForce 8800, but Compute Shaders are very different in some ways. You still write them as regular HLSL, and compile and execute them much like other shaders in D3D, where Cuda was more similar to C/C++ and blended into your application code almost seamlessly). After all, Compute is one of the biggest new features of D3D11.
I may look into alternative algorithms in the future, since SAT is not all that ideal for CS yet. But first I will extend my multipass algorithm with the accumulator rendertarget that I mentioned, so that I can support filter kernels of arbitrary size.
Posted on 2011-02-28 16:05:25 by Scali
Okay, after a lot of debugging, I have now finished my other separable blur filter as well. It can now handle any filter size. It's slightly less efficient than before, but still very acceptable, even for strict powers-of-two.
Funny thing is that the performance is now a bit unintuitive, so to say. For example, a filter kernel of (63,63) is slower than (64,64). That's because 63,63 is a worst-case, which results in a lot of powers-of-two to accumulate, where (64,64) just needs to accumulate one result.

As a side-effect, this makes my Compute Shader look better in comparison. I've optimized it a bit further, and it will now perform somewhere in the range of 700-780 fps for the same scene.
For small kernel sizes, the power-of-two filter is still faster, but it drops below 700 before you reach (64,64) kernel size... Which I would say is still a valid kernel size for practical uses (especially at higher resolutions).
Another thing is that the Compute Shader delivers better quality. The quality of the power-of-two filter degrades at every pass, especially when you need to accumulate many terms (I could switch to floating point rendertargets, but they'd require a lot more memory and bandwidth, so they'd probably be a lot slower).

And what is also cute: the table allows you to filter with any kernel size. Not only can you calculate the SAT once, and then produce a number of textures with different levels of filtering... You could also change the filter size per-pixel. This is an interesting feature for when you want to simulate certain camera/lens effects, for example (soft focus/depth-of-field). You could make the amount of filtering per pixel dependent on the distance of the pixel to the camera. Or for more abstract effects: you could have some sort of 2d-animation that drives the filtering. A bit like an alpha-mask, just applied to the filter.
Posted on 2011-03-02 17:17:09 by Scali
VJ'ing makes you think of graphics in different ways than regular games or demos. One such thing is the concept of 'videowalling': using a large amount of projectors (or flatpanels) to create a huge display surface.

Modern videocards may be able to drive two or three monitors, but that is not enough in this case. And even if it were possible to connect enough monitors to a single videocard, it may not be feasible to use a single videocard because of memory and performance limitations, as a result of the extremely high resolution being used.

So at the very least I would want to be able to support multiple videocards. Even better would be to support multiple PCs in a network.
We are going to be implementing both. Multiple videocards will be the first. We will add a networking layer later.

A problem with using multiple videocards or PCs is that you want the entire display area to have a single camera. When you cut this camera view up in smaller sections, the perspective of the large single camera needs to be preserved, otherwise the sections will not fit together properly.
A naive approach would be to have each renderer render the entire view to a large rendertarget, and then cut out only the part that they require for their particular view.

I've worked out the math for a projection matrix that does the work of cutting out a subsection. By using the proper skewing factors in the projection matrix in homogeneous space, you can effectively do translation after the division by W... so after perspective is applied. Hence, the perspective is the same as the non-translated camera, but the world is shifted inside the camera. Combine this shifting with a scale factor, and you can render any subregion from the camera that you want.

This way each videocard will only have to render the part of the scene that they actually display, without having to support a very large rendertarget, so you get maximum efficiency.

This idea also works the other way around: You can also scale your camera down, and move it anywhere on screen while maintaining the original perspective. This can be useful if you want to have multiple viewports or some kind of picture-in-picture effect. Instead of rendering each view to its own rendertarget, and then copying them all to the final screen, you can render them all directly to the larger target, in the right place.

Another advantage of this approach is that you don't require SLI or CrossFire when you want to use multiple videocards in a single system. The videocards don't even need to be the same brand or type. This can cut hardware costs significantly, and make the setup far more flexible. It may even be slightly more efficient than SLI/CrossFire, as there doesn't have to be any communication or sharing of resources between the videocards.
Posted on 2011-04-22 05:56:47 by Scali
Hi, Scali.

Maybe you will find softTH interesting. Closed though.

I've had some experience with all this but now I have a single 295, which supports surround2D, and this thing delivers.
Posted on 2011-04-23 19:44:49 by HeLLoWorld