My program uses BitBlt extensively and I'm wondering whether if I change it with manually optimized memory movement of image data, it will be faster. I have a GeForce 2, and since I have it I see an overall speedup of all programs with 100%. I know that DDraw has a "Blt" function, but I'm not sure whether the desktop window uses it.
The program I'm making is designed for people with best videocards, so hardware acceleration is welcome.
Thanks in advance :alright:
Posted on 2002-10-24 09:53:41 by Ultrano
BitBlt is a raster capability of the graphic card. Some devices don't have it !! (bad huh)
BitBlt has historically used DMA access method to transfer block-images quickly and asynchronously. The function will (I believe) switch to a non-dma method if dma is not available or if the card doesn't support RC_BITBLT

In answer to your question, nothing is faster than DMA.
Continue to use BitBlt, no register-based copying loop will ever touch it.
Posted on 2002-10-27 10:13:11 by Homer
Once again i have to disagree :(

GDI BitBlt has to make a lot of parameter checking (do not forget that is the same functions used for printers also) also has to take into account Device Independent Bitmaps and all kind of resolutions and pixelformats...

So GDI BitBlt is very slow ... on new operating systems like Win2k and XP it tries to use 2D hardware acceleration (if available) at least to do alpha blending ... However 2D is badly treated in video boards drivers...

Our tests at HE developement show that we can beat 4x up to 10x GDI BitBlt with simple non optimized register copy loop (not even using MMX or SSE etc). However in doing so you will loose portability of data and have to take care of every pixeld format/data format/resolution available (this is not a problem in HE game since we force resolution/pixelformat used)

DirectDraw speed is another matter we can not beat video to video BitBlt not even in our dreams.

To give you an example it takes us 4000 to 10.000 microseconds to do a fullscreen software BitBlt from system to video memory and it only takes 40 microseconds to do the same full screen BitBlt to the video board if source and destination are both in video memory ... :)

Besides do not even think of reading pixels from video memory surfaces it is damn 100x slower

However we can even beat DX BitBlt most of the time on system to video transfers (esp with AGP boards) but by a close call...

We always win about 10% but sometimes we can even winn 100% or more but then i suspect it is the video board drivers bad implementation. Simple register unrolled loops will beat them, again no hard optimizations are needed but could be used to gain even more.

DMA is slow, it is no used in video boards from decades (eh exagerating here a little) The only advantage of DMA is that theoretically you will leave time to CPU to do other tasks, well in practice this depends on how much data and code you have in CPU cache and how much will the DMA transfer hold the system bus innoperative...

Even worst on multi CPU systems... so my advice leave DMA to FDD and HDD systems... video is to fast for that

Think about it a little current CPU are able to run at 1.3 ... 2.2Ghz while GPU are still at 300-400Mhz even with 4x bus size they just can not cope with CPU computational power... maybe this will again change in the future as GPU advance...

There are many modern programming myths that are utterly wrong and hypocrites based, since i finished some running contracts i hope i will have more time to write the articels to distroy such missconceptions :P

If we have time i will even consider makeing some demo programs to show this interesting facts of life :)
Posted on 2002-10-27 13:18:57 by BogdanOntanu
OK so I'll qualify those remarks, which I should have.

If you are blitting from and to video memory, chances are that this will be done by the video card in hardware, using direct memory access, and not affecting the cpu at all (as Bogdan pointed out).
If you are blitting from and to system memory, chances are that this will be done by the motherboard dma used normally by the io hardware for things like asunch disk accesses.

But if you are blitting between video and system memory, you have no path to perform dma across from the video memory to system memory.
This is a hardware issue which needs addressing in future video cards.

Yes, if you wish to transfer large chunks of data between video and system memory quickly, I use MMX for this.

Hope that clears it up.
(Oh since when has dma been slow? it occurs at the system clock speed !!! )
Posted on 2002-10-27 22:33:53 by Homer
Thanks, I really think this is the solution :) - to copy manually
DCs are checked for regions, too, and this, I suppose, takes too much.
For as the width of the song window is fixed, there'll be ways to optimize it.
During search for solution, I found a site for optimizing code, that tells how cacheing is done- 32 bytes cache line is read with an interrupt ! With this interrupt the OS decides which data to be given, by what address to be told to be, and a lot of shit that can trick the cpu. it takes 10 cycles, as told, to read these 32 bytes, and after you read one of them, you can easily read the others ( I suppose ).


http://www.iseran.com/Win32/CodeForSpeed/
Posted on 2002-10-29 07:17:11 by Ultrano
huh,

Very curious!!!

I made speed tests on the previous version of Dreamer: Target instrument: CB101, which is 550 * 235 pixels, tested at 32-bit screen depth at my GeForce2MX200. There are 2 memory DCs : MEMDC1 and MEMDC2, the first contains the bitmap from the resource the second is the same size as the first, and contains an empty bitmap(already created), MEMDC1 is being read from, and MEMDC2 is written into, then BitBlt-ted onto screen.

invoke BitBlt,MEMDC2,0,0,MY_WID,MY_HEI,MEMDC1,0,0,SRCCOPY

then, there are many calculations about each element (over 600 elements in CB101), and between calculations there are several bitBlts (for each knob, for each LCD cell)
Overall result: only 350,000 cpu cycles !!! The data to be moved only by the above function is 520kB, so it is either as fast as movmd, or faster! There is one final BitBlt from MEMDC2 to the window of the instrument. Including this Blt takes nothing !!! Only several hundred cycles !!!
But this is not the most curious !
I copied the first Blt call, the one pasted above, and pasted it 4 times in a row, and added to the debug msg a "version check", so that I do not test an "old" version of the instrument, and no measuring errors are present. Having added the row of BitBlts from+to memory bitmaps, I measured the overall cpu consumption of the painting function of CB101: ... 355,000 cycles !! I decided that the GDI functions have detected the row of repeated Blits, so I made a row of 10 blits, each different from the others.
only 365,000 cycles !!

I made the tests at >8fps, while the instrument is playing ( 7fps automatically "invoke InvalidateRect,hWnd,0,0" ), and dragging quickly the instrument's window in and out of view. Maximum cycles measured: 420,000 .

Computer characteristics:
Win98SE, 451MHz K6-2, 64MB @66MHz , Aladdin5 mainboard (partially incompatible with the GeForce), GeForce2 MX200 (set with 1x AGP, and most hardware acc. funcs disabled, otherwise computer won't work), YMF724 DirectSound-accelerated sndcard

cpu consumption of Dreamer:
4% without any song open (DirectSound working non-stop)
6% with a song playing with a CB101 and an HD1, 7fps refresh rate


After being surprised by BitBlt, I made the same tests on the new version of DR (unreleased, much different than the current DR). Here, I move an image, sized 800x102 (320kB) from MEMDC1 to MEMDC2, from MEMDC2 to window, no medium computation. All takes 10,000 cycles ! Framerate : window re-drawn (not theoretically, but measured practically) 20 times a second ! This takes less than 1% cpu, measured with "MS System Monitor for Win98SE"

I feel sorry today I made a bunch of functions, "optimized" BitBlt and stuff, which are in fact worthless, for the BitBlt is so fast.

I had once tested DR with my previous videocard - S3 Trio3D 4MB. Moving of windows was hell, glitches came out of the song, framerate fell below 2fps. All these tests shout that the GDI BitBlt is somehow accelerated at my new card.

When I run MS ProcessViewer95 (provided with VisualC++ introductory edition), there's always a process DDHELP.EXE. Can this be "DirectDraw Help" ?? Is the desktop window accelerated anyhow? I tested memory consumption when I create a window 600x125 pix - exactly 300kB are being allocated (as big as the bitmap for the window DC is), but cannot estimate whether it is RAM or VRAM, they're mixed together.

byez
Posted on 2002-10-29 10:35:50 by Ultrano
More results from the GDI Lab at home :) :
I have two 800x800 windows, and one global workspace bitmap, where I work on and then blit to the screen. 86 times a second I do the following operations twice:
1) fill whole workspace with black paint
2) copy backbuf1 to backbuf2 , then backbuf2 to workspace
3) 10 times copy backbuf3 to backbuf3 , then backbuf3 to workspace
4) blit workspace onto window

At this framerate (86) if I do once the operation, there's no prob, cpu usage is below 3%. But if I do it twice, although cpu usage is 6%, the window message queue is being delayed with ... 4 seconds ! You click on the head of the window, and move the mouse in order to move the window, but it stands still, and begins to move itself after these 4 to 7 seconds :) . A pretty funny thing. I'll save the explanations for now.
Posted on 2002-11-24 23:03:12 by Ultrano