Right, so for this VJ-tool we're building, one of the things I had to do, was to be able to stream video to a texture. This would require DirectShow. Since I know nothing of DirectShow, I figured I'd google for an example first, then go from there.
I found this example on the Ogre wiki: http://www.ogre3d.org/tikiwiki/DirectShow+video+in+ogre+texture&structure=Cookbook

So, I modified the code a bit to make it work with Direct3D. Initially I got an update rate of about 380 fps out of the test video I was using.
Technically that is 'good enough', but I'm not the type of person who settles for just 'good enough' :)
When I was reading through the code and modifying it to work with D3D, I already noticed this rather peculiar loop:
        // go set all bits...
        for (i=0; i<(dsdata->videoWidth*dsdata->videoHeight*3); i+=3){
            idx=(x*4)+y*pixelBox.rowPitch*4;

            // paint
            pDest=bmpTmp;//b
            pDest=bmpTmp;//g
            pDest=bmpTmp;//r
            pDest=255;//a

            if (shouldBeMirrored){
                x--;
                if (x<0){
                    x=dsdata->videoWidth-1;
                    y--; if (y<0) y=0;
                }
            }else{
                x++;
                if (x>=dsdata->videoWidth){
                    x=0;
                    y--; if (y<0) y=0;
                }
            }
        }


Since this is the main loop, which copies the video frame from the DirectShow sample grabber to the texture, this bit of code is going to have a major impact on performance.
To the experienced eye, I suppose it is quite clear what is wrong with it right away...
There are a couple of nested if-statements in the inner loop!
Aside from that, I find the choice of structuring the whole thing as a single loop, rather than two nested loops (one over x, one over y), a bit strange as well.
If you make it two nested loops, you can already remove quite a few of these if-statements from the innerloop, and have them evaluated only once per scanline, rather than once per pixel. Not to mention that the code will be easier to read and understand.
And if you just duplicate the innerloop into a mirrored version and a regular one, you can remove the if-statements from the innerloop altogether!

Once I did that, the performance jumped up from 380 fps to about 460 fps. Which is good, but I was not satisfied yet.
Closer inspection of the entire routine showed this general structure:
        char *pBuffer = new char;
        if (!pBuffer)
        {
            // out of memory!
            throw(" Out of memory or empty buffer");
        }
        hr = dsdata->pGrabber->GetCurrentBuffer(&cbBuffer, (long*)pBuffer);

...

        // bye
        delete[] pBuffer;


Right, another death trap right there. Memory allocation is expensive!
If you only need a temporary buffer within the scope of a function, you should generally use the stack. The C library offers us the alloca() function to allocate memory on stack, which is automatically deallocated when the function exits.
Simply replacing the new-statement with alloca(), and removing the delete[]-statements made the framerate jump from 460 to 520 fps.

This is nice... but we can probably do better. Namely, a single frame requires a pretty large buffer. Which means the stack will probably have to be grown, and guard pages need to be touched etc. alloc() will take care of this for you, but it might cost a bit of performance.
So I figured I'd try to just allocate the buffer once and store the size. If a larger size is ever required, I allocate a new buffer, otherwise I just re-use the buffer, and skip any allocation and deallocation. The memory will be allocated all the time, but that's not really a problem, since I need to update the texture for every frame anyway.

And there we have it: the code is now touching 600 fps. An improvement of almost 58%. And that is the total framerate, including the decoding of the video and rendering the texture to screen. 58% gained by just copying the frames more efficiently.

And it might be possible to do even more optimizations. For example, the video frames are currently decoded to a 24-bit pixelformat, while the texture is 32-bit. If I have the video frames decoded to 32-bit directly, I don't need to reorder any bytes, and can just burst-copy one scanline at a time. The decoding itself may also be more efficient, since 32-bit pixels are nicely aligned on dword, where 24-bit pixels are a mess.
Another option is to apply a slightly different frame grabbing technique. Currently you get a copy of the frame that the sample grabber has already buffered. I believe it is possible to get a pointer directly to the sample grabber's buffer. In this case that would work out fine, since I only need to read the data anyway, I don't need to modify it or anything.
In theory you could even have the video frame copied to the texture directly, but the problem is with the pitch. The scanlines of a texture may have some padding bytes to improve texture fetching efficiency. So the pitch may not be the same as the width of a scanline. If they are not the same, you will need to do an extra copy to reorder the scanlines. But if they are the same, you could grab the frame directly to the texture. This could be implemented as a special case. Or perhaps it is possible to implement a custom sample grabber which can handle pitch.
Posted on 2011-01-13 02:18:06 by Scali
I have dug into this code a little deeper...
I can report that changing from RGB24 to RGB32 is actually slightly slower, not faster. At least, in this case (might be different for other movie formats).

Another thing I have improved is that I have implemented an ISampleBufferCB interface. This gives me a callback when a new frame is decoded.
The Ogre example will copy the last frame to the texture every time updateMovieTexture() is called. It does not check whether the frame is actually updated.
Although I got 600 fps out of the movie, the actual movie only runs at about 24 fps, so most of the updates are redundant.

I have now used the callback to signal when a new buffer arrives, and I update the texture only once after a new buffer. A side-effect of the callback is that you get a pointer to the buffer directly. This means I no longer have to allocate my own temporary buffer, and use GetCurrentBuffer() to get a copy.
Since I get a pointer to the buffer, I no longer need the ISampleBuffer to buffer the frames for me, so I can set SetBufferSamples(FALSE), saving some extra memory and copy operations.

With this new and improved code, things work very nicely, with DirectShow decoding the movie on its own core, and signaling me whenever a new buffer arrives. This allows my renderer to continue rendering at full blast, yielding thousands of frames per second.
This solution should now scale very well to multiple videos/textures, and also to higher resolutions. That should allow for some pretty nifty realtime movie blending and other trickery in HD-resolution.
Posted on 2011-01-15 13:32:25 by Scali
Hi, i'm trying to implement the interface ISampleBufferCB in a program (using dshow), but the callback function is never called by the grabber.
Could you show me how can i do.

Thanks
Posted on 2011-04-12 19:12:57 by The Morlok
If the grabber is never called, then your graph is probably not correct (video content may not be decoding properly, or the sample grabber is not in the right place).
I would suggest checking with GraphEdit.
A really handy trick is to use the Running Object Table to register your graph, so you can debug and edit it 'live' with GraphEdit, and see where the problem is.
For more info, see here: http://msdn.microsoft.com/en-us/library/dd390650(v=vs.85).aspx
Or here: http://stackoverflow.com/questions/27832/how-can-i-reverse-engineer-a-directshow-graph
Posted on 2011-04-13 15:24:22 by Scali