I'm currently programming a ddraw blitter (who hasn't, he) and I'm adding alpha functionnality to it (all thos blits are in ASM of course). I knew from the start there was going to be a penalty for working with sprites in system memory, but there is one detail I can't understand at all.

When I'm blitting from system memory to video memory, most vid cards (including mine, I've checked the CAPS returning values for this) can use DMA to offload the cpu while doing it. However, it seems as if it is not doing it (I've added the flags to Blt, to no avail), and even though I've taken care of writing to a different surface while the other one is being used for the blit. When I did this, it didn't even gain in speed at all...

Now, is there anyone with experience on this subject who could help me out?

I'm planning on further optimizing the individual asm blit functions with mmx and sse perhaps, but if the bottleneck remains that stupid sysmem->vmem blit, what good'll it be??? And I'll try threads if I can't come to any consistent result.
Posted on 2004-03-21 18:24:42 by persil
Sometimes you can beat the DX system to vram blit by 5% sometimes you can not, it all depends IF the video bard has some hardware acceleration for this kind of blit and/or if this acceleration has bugs.

You are right most video boards do not use any kind of DMA/AGP for system->vram and/or its speed gain is pathetically small. The fact that there are flags there is irrelevant ;) as they are not used.

When a game has lots of sprites the finall sys->vram blit will not matter that much overall anymore; but i agree that it generally does. There is nothing you can do about this.

The best shot is an algorithmical optimization like i use in SOLAR OS: some kind of dirty zones.

For example i keep an array of flags for dirty Y lines and only blit those that get dirty from system to video... usually this gives you 2x speed boost... depending on how many sprites/lines do change on your screen and how dynamic the game/application is.

You can theoretically improve this by using some dirty rectangles...

I agree that for a fully dynamic game -- aka everything changes all over the screen every frame-- there is little that can be done and in fact dirty lines or rectangles will actually slowdown.

Changing everything all over the screen at high resolutions is a job that today CPU's are unable to do... not faster than 20-30FPS that is... and eating most CPU power.

This is why hardware acceleration exists inside today video boards, the patheticall stuff is that drivers are offering this acceleration ONLY for 3D (mainly 3d games and demos) and skip many of needed 2D functions (like alpha blending Blit but also line draw and circle draw for example)...

such is life...
Posted on 2004-03-22 02:59:39 by BogdanOntanu
yes, i ve played with hires software blit under dos and vesa and thats not blazing fast evan with todays cpus...
but i came across these demos by realtech (run in pure dos):

"dx project":
http://www.pouet.net/prod.php?which=955

this one is 640.480.8 and runs incredibly smoothly on my P133.
I just CANT figure how they did it.
of course its 256 cols (i think they use dithering!!!!!!) but with what i know (vesa 2.0, draw in system ram, blit to lfb with rep movsd), i CANT make it as fast EVEN IF I BLIT WITH A STILL PIC, even without a 3D engine...

HOW DID THEY DO?
maybe they draw in vram and call a pageflip func... but isnt vram slower?(you usually draw more pix than there are in the screen in one frame because of overdraw... so if vram is slower better use sys ram+copy...
ofcourse with 32b theres 4 times more data.

but this one, hires, and truecolor, i think, is maybe more amazing:

"countdown":
http://www.pouet.net/prod.php?which=1524

its smooth on my 133...
how did they do?

what could be done on a multi GHz machine?
Posted on 2004-03-22 06:37:36 by HeLLoWorld
BogdanOntanu :

Well, thanks for the inside, so DMA really does nothing much to improve speed :/ Then, do you or anyone have any idea why I get 2~3X faster for a split second when my program starts?? Everything is drawing on screen, and still it's faster, but just for a second.

And do you have any idea if I could get a speed-up by placing my sysmem->vmem blitter in a thread, running at the same time I'm updating the next frame???

HelloWorld:

Yeah, these demos were kinda cool for the time, and they're still amazing :) But on my laptop with XP, it's hard runnin' a DOS app h? :)
Posted on 2004-03-22 07:38:42 by persil

any idea why I get 2~3X faster for a split second when my program starts?? Everything is drawing on screen, and still it's faster, but just for a second.


!!!!!!!?????


And do you have any idea if I could get a speed-up by placing my sysmem->vmem blitter in a thread, running at the same time I'm updating the next frame???


excuse me if i missed something, but as long as your blitter uses cpu instructions to blast to vram, like rep movsd or fpu or simd, it doesnt run in true parallel with your updateframe code, its just windows that cuts your two programs in tiny time slices, so it could only be slower i think...
this multithreading is only useful when you split sequential use of different slow resources into two threads in order to start them together, for example starting a disk read and doing a big long computation before it ends (though the example is maybe not the best since with one single thread the OS would maybe do sthg similar)...

your only hope would be that the ddraw blit HAD sort of a DMA transfer...

does anyone know how these demos achive speed on a 133 with a s3triov64+ board?
not talking bout the 3D, but the blit...
Posted on 2004-03-22 09:57:57 by HeLLoWorld
...or have you got 2 CPUs maybe ?
Posted on 2004-03-22 09:58:52 by HeLLoWorld
HeLLoWorld, what about not blitting at all, but drawing directly to the video memory? Not an option if you need some sort of post-processing, but then you might as well move to hardware acceleration.

Yes yes too bad 2d accell sorta sucks and that there's no DMA flip etc etc but the PC architecture sucks, deal with it or do something else - you can't change the world, so live in it.
Posted on 2004-03-22 10:04:10 by f0dder
would it be possible that a32 rep movsd would not be as fast when cs and ds are 16 bits?
(0x66 should be generated?)

ds souldnt change anything...
cs, i dont see why either, but...

i dont know how rep works internally , but for rep movsd i think its a hardware memcpy that does it...
Posted on 2004-03-22 10:54:23 by HeLLoWorld

would it be possible that a32 rep movsd would not be as fast when cs and ds are 16 bits?
(0x66 should be generated?)

Dunno if it's as fast as in native 32bit mode, but it did give speedup to use "rep movsd" in 16bit code on the hardware used back those days :)


i dont know how rep works internally , but for rep movsd i think its a hardware memcpy that does it...

On some intel processors there's special hardware to handle rep movsd, on others there aren't.
Posted on 2004-03-22 13:28:30 by f0dder
1) don't blt unless you really REALLY have to
(draw directly to vram)

2) let the hardware handle it, it's faster than your CPU
I have a hard time believing that a Blt from a ddraw sysmem texture to the frontbuffer isn't accelerated on most cards. Oh well, guess I'll have to do my own testing one day.

3) the fastest way to draw something is not to draw it
"Dirty methods" like the dirty scanline stuff bogdan talks about, "Dirty rectangles" when doing windowing systems, etc. If you update more or less whole screen every frame, render directly to the card.

On CPUs that support it, try using a MOVNTQ copyloop instead of mmx, rep movsd, FPU-copy or whatever.

Btw, a trick if you're doing a fullscreen 3D renderer - only update half of the scanlines on every screen update :)
Posted on 2004-03-22 14:06:46 by f0dder
Ok, first thanks for discussing about this...

HelloWorld:

I meant that, for a split second, when I start my test application, the frame rate is higher than it is for the rest of the time, perhaps double or more. But this only lasts for a really short period, and I'm really wondering as what is causing this...

Other than that, I'll try optimizing the copying if nothing else matters.

About the thread, I thought that, maybe, since blitting is a lot influenced by memory latency, it could leave the cpu free when it is waiting for memory, or something like that, who knows...

Dirty rectangles... I've thought of that, but the fact is that for some effects, the whole screen is gonna be updated, so to keep frame rates consistent, I'm trying to optimize it as a whole. I know for sure that if I end up implementing the 16 bits version of all my blits, it'll be more acceptable.

Some suggest writing to vram directly, I'll certainly like that, but the whole point is that I'm implementing per-pixel alpha and alpha-blended blits, so I need to read from the target surface, which turns out to be deadly slow when it is vram.

-

I thought of something and I don't know if it could be worth the hassle. Could it be possible, and would it yield any speed gain if I would prepare a 2d scene first using blits, but only really doing something at the end, calculating every drawn pixel only once... Hmm... very hard to explain, d'y'all understand what I mean? I mean only writing each pixel only once, avoid overdraw. Is it feasible, at the very least?
Posted on 2004-03-22 18:36:39 by persil

I meant that, for a split second, when I start my test application, the frame rate is higher than it is for the rest of the time, perhaps double or more. But this only lasts for a really short period, and I'm really wondering as what is causing this...

How are you implementing the fps counter? Could be inaccuracies due to, say, small values of GetTickCount-starttime - this will stabilize over time.


About the thread, I thought that, maybe, since blitting is a lot influenced by memory latency, it could leave the cpu free when it is waiting for memory, or something like that, who knows...

Not really - thread scheduling is done on a time-quantum basis by the windows kernel, affected by the Real-Time Clock interrupt. It doesn't have to do with whether there are unused execution units in the CPU. SMP machines (including hyperthreading) might see a small gain, but you have data synchronization issues you need to handle - which might eat up any speed gain you get.


Dirty rectangles... I've thought of that, but the fact is that for some effects, the whole screen is gonna be updated, so to keep frame rates consistent, I'm trying to optimize it as a whole.

Yup. You should only use such dirty techniques where they can actually be beneficial, as bogdan also points out.


Some suggest writing to vram directly, I'll certainly like that, but the whole point is that I'm implementing per-pixel alpha and alpha-blended blits, so I need to read from the target surface, which turns out to be deadly slow when it is vram.

Then you're stuck with with a sysmem buffer and blitting. Ddraw surface in system memory, blitting to ddraw backbuffer in video memory. I would suggest timing the ddraw routine on more than a few video cards. And for your own blitting routine, do use MOVNTQ if available.


I mean only writing each pixel only once, avoid overdraw. Is it feasible, at the very least?

Depends on whether you add a lot of additional complexity, I guess. This has a lot to do with analyzing your scenario, and choosing correct algorithms etc. Like z-buffer vs. span buffer when dealing with software 3D engine. Also, again for a 3D engine, determining what actually has to be rendered (BSP/Portals/OCTree, backface culling, polygon clipping, ...)
Posted on 2004-03-22 18:47:08 by f0dder
How are you implementing the fps counter? Could be inaccuracies due to, say, small values of GetTickCount-starttime - this will stabilize over time.


I don't know, but I'm using QueryPerformanceCounter to get a high-resolution result, so...??


Not really - thread scheduling is done on a time-quantum basis by the windows kernel, affected by the Real-Time Clock interrupt. It doesn't have to do with whether there are unused execution units in the CPU. SMP machines (including hyperthreading) might see a small gain, but you have data synchronization issues you need to handle - which might eat up any speed gain you get.


I though so, but I had hopes anyway... Too bad :( But I'm pretty sure that for a real SMP machine, the gain would be quite real.

Then you're stuck with with a sysmem buffer and blitting. Ddraw surface in system memory, blitting to ddraw backbuffer in video memory. I would suggest timing the ddraw routine on more than a few video cards. And for your own blitting routine, do use MOVNTQ if available.


Yeah, I'll try that... But... I'm checking in Intel's manual and... is it right that I can only write to memory with MOVNTQ? Then do I only use MOVQ to read the memory first?


Depends on whether you add a lot of additional complexity, I guess. This has a lot to do with analyzing your scenario, and choosing correct algorithms etc. Like z-buffer vs. span buffer when dealing with software 3D engine. Also, again for a 3D engine, determining what actually has to be rendered (BSP/Portals/OCTree, backface culling, polygon clipping, ...)


Okay, but for a 2D engine, the whole point would be of knowing which sources will be involved for each pixels, or zones perhaps, and only calculate that which is visible. For example, if I blit the background, then blit an opaque sprite over it, the background should be totally ignored for those pixels, and if an area hasn't changed I should also ignore it. That's what I'm thinking about...
Posted on 2004-03-22 19:15:26 by persil

I don't know, but I'm using QueryPerformanceCounter to get a high-resolution result, so...??

*shrug*. FPS counters tend to stabilize over time, especially the simple "currentframe/elapsedticks" kind that don't have any adjustments.


I though so, but I had hopes anyway... Too bad :( But I'm pretty sure that for a real SMP machine, the gain would be quite real.

Depends on how much time is wasted in the access synchronization. I guess you'll have to time & test :)


Yeah, I'll try that... But... I'm checking in Intel's manual and... is it right that I can only write to memory with MOVNTQ? Then do I only use MOVQ to read the memory first?

Yup. The purpose of MOVNTQ is to store directly without going through cache, thus minimizing cache pollution. Plus some other fancy stuff ("will not generate a read-for-ownership bus request for the corresponding data line" - ****, intel manuals.) After the blit routine is done, issue an sfence.

The P4 has a hardware prefetcher, so manual use of prefetch instructions probably won't help. Dunno if P3 has a hardware prefetcher, so prefetch instructions might be useful here. I guess it would have been nice if the P4 had a prefetch that went only to L1 cache, so you wouldn't trash the L2 cache with the MOVQ loads, but... oh well.

You have 8 MMX registers btw, so use them and unroll the scanline loop. 800x600 and 400x300 are annoying resolutions ;), but most other common horz. rez (320, 640, 1024, 1280, 1600) are evenly divisable by 64. You could do a xrez-divisiable-by-64 check at program startup to choose the blitter function to use. Also, do work on scanlines rather than one big width*height*bypp move, since you will be dealing with pitch - but you probably already do this.


the background should be totally ignored for those pixels, and if an area hasn't changed I should also ignore it. That's what I'm thinking about...

That *could* turn out to be so much overhead that any gain will be drowned. If you had a mostly static background with moving sprites it would be something else. But something like a RTS game where you scroll around a lot, I don't think this strategy would work out well.
Posted on 2004-03-22 19:38:13 by f0dder
*shrug*. FPS counters tend to stabilize over time, especially the simple "currentframe/elapsedticks" kind that don't have any adjustments.


Perhaps you're right. But in that test, I didn't attempt to stabilize frame rate, I'm just drawing the faster I can. And it's not really the counter that tells me it's going faster, it's just me, 'cause it *really* is going faster. If you'd see this, it's pretty strange...


Depends on how much time is wasted in the access synchronization. I guess you'll have to time & test


Well... for this I'll have to pass... 'cause my SMP machine has been dismantled and spread all over since :)


Yup. The purpose of MOVNTQ is to store directly without going through cache, thus minimizing cache pollution. Plus some other fancy stuff ("will not generate a read-for-ownership bus request for the corresponding data line" - ****, intel manuals.) After the blit routine is done, issue an sfence.

The P4 has a hardware prefetcher, so manual use of prefetch instructions probably won't help. Dunno if P3 has a hardware prefetcher, so prefetch instructions might be useful here. I guess it would have been nice if the P4 had a prefetch that went only to L1 cache, so you wouldn't trash the L2 cache with the MOVQ loads, but... oh well.

You have 8 MMX registers btw, so use them and unroll the scanline loop. 800x600 and 400x300 are annoying resolutions , but most other common horz. rez (320, 640, 1024, 1280, 1600) are evenly divisable by 64. You could do a xrez-divisiable-by-64 check at program startup to choose the blitter function to use. Also, do work on scanlines rather than one big width*height*bypp move, since you will be dealing with pitch - but you probably already do this.


Good, thanks... My target minimum system requirements is P3, so I guess this instruction exists on these processors?

What is SFENCE anyway?

About the resolution, well, hmmmm... too bad, he, my project has been 800x600 based since the beginning :/ But could I use MOVQ/MOVNTQ for the beginning of the line and the rest normally, or the cache stuff would not like that in this case? I also read somewhere that MOVAPS is one fast way to block move also, is it? As for the pitch, well, I'm taking care of it :)

That *could* turn out to be so much overhead that any gain will be drowned. If you had a mostly static background with moving sprites it would be something else. But something like a RTS game where you scroll around a lot, I don't think this strategy would work out well.


Ah, I guess you're right... It would prevent using MMX, because of the need to take care of each pixel individually... Ah, well, perhaps not, but... It would have to be REALLY smart to work right :)


On another front, I've just tried using DDSCAPS_NONLOCALVIDMEM with DDSCAPS_VIDEOMEMORY flags when creating my back buffer instead of using DDSCAPS_SYSTEMMEMORY and... although it's not exactly as much as I had hoped for... it's strange.

Example:

Back buffer is in Sysmem:

Blit BG to Back : 3~4ms
Blit Alpha Blocks : 6ms ( theres 400 of them, each 30x30, per-pixel alpha + blended)
Blit Back to Front: ~10ms

Back buffer is in Non-local video memory (what the heck is that anyway, AGP memory perhaps?)

Blit BG to Back : ~4ms
Blit Alpha Blocks: ~60ms
Blit Back to Front: ~4ms

Back buffer is true video memory

Blit BG to Back : ~9ms
Blit Alpha Blocks : ~310ms
Blit Back to Front: 0ms

Hmmm... so this gives an idea of each one reacts, for my card anyway, but it sure shows how much video memory dislikes being read from...

And... the final cut is really that my buffers have to be in system memory to be usable... And that blitting from sysmem to sysmem is faster than sysmem to vram... and, 10ms for a frame update shows you how much time it consumes just to show what's been drawn... max 100 fps if only drawing is done... approx...
Posted on 2004-03-22 20:00:43 by persil

Good, thanks... My target minimum system requirements is P3, so I guess this instruction exists on these processors?

Afaik, MOVNTQ and friends were introduced with SSE, so they should be. SFENCE should be used after you're done with a bunch of non-temporal stores (like MOVNTQ), to make sure that... umm, everything works out correctly. Have a look at the intel manuals :p

There's no problem in using MOVQ+MOVNTQ for 800x600, except that 800 isn't even divisable by 64 (8 MMX regs of 8 bytes each). You can do 12 full iterations of the 64-byte loop though, and then handle the remaining 32 bytes after the loop (still with MOVQ+MOVNTQ).

MOVAPS is designed for single-precision float values. It will handle 4 floats at a time and thus 128 bits instead of MMX's 64bit, and the XMM registers aren't aliased on the floating-point stack (thus no EMMS required). However, I'm not sure whether there might be some drawbacks to MOVAPS - like if your byte data would be a NaN or similar. Also, data must be aligned on 128bit boundary, or you get a protection fault.

Oh and finally - you might want to try fiddling around with Direct3D, to use all your video hardware acceleration. Not saying that you should change your game to 3D, but rather using 3D acceleration features to accelerate your 2D. Will take a bunch of code to implement, but might be worth the effort?
Posted on 2004-03-22 20:17:29 by f0dder
Afaik, MOVNTQ and friends were introduced with SSE, so they should be. SFENCE should be used after you're done with a bunch of non-temporal stores (like MOVNTQ), to make sure that... umm, everything works out correctly. Have a look at the intel manuals :p

There's no problem in using MOVQ+MOVNTQ for 800x600, except that 800 isn't even divisable by 64 (8 MMX regs of 8 bytes each). You can do 12 full iterations of the 64-byte loop though, and then handle the remaining 32 bytes after the loop (still with MOVQ+MOVNTQ).


Thanks a lot for the tips. I'll look into that for sure! I just hope it can improve the whole thing!!!

MOVAPS is designed for single-precision float values. It will handle 4 floats at a time and thus 128 bits instead of MMX's 64bit, and the XMM registers aren't aliased on the floating-point stack (thus no EMMS required). However, I'm not sure whether there might be some drawbacks to MOVAPS - like if your byte data would be a NaN or similar. Also, data must be aligned on 128bit boundary, or you get a protection fault.


Yeah, I find the idea strange myself...

Oh and finally - you might want to try fiddling around with Direct3D, to use all your video hardware acceleration. Not saying that you should change your game to 3D, but rather using 3D acceleration features to accelerate your 2D. Will take a bunch of code to implement, but might be worth the effort?


Yeah, that's what I'm reading everywhere, and what I've kept telling myself since the project started... but hey, you can get the guy out of 2D, but you cannot take 2D out of the guy ;) Thanks for taking time to reply!

And if some other ideas emerge, I'm all ears, thanks.

edit: typo
Posted on 2004-03-22 20:24:55 by persil

Yeah, I find the idea strange myself...

Regular FPU move was used before MMX came along, and it did speed up stuff. Some fiddling was necessary, though. I would look closely at intel docs for MOVAPS and what other people have to say, before blinding implementing it. Might turn out okay though, coupled with MOVNTPS non-temporal store.


Yeah, that's what I'm reading everywhere, and what I've kept telling myself since the project started... but hey, you can get the guy out of 2D, but you cannot take 2D out of the guy ;)

Once you're done playing around with blit optimizing, have a look at it nevertheless - might turn out to give some nice speed gain.

Time to hit the sack...
Posted on 2004-03-22 20:35:18 by f0dder


Regular FPU move was used before MMX came along, and it did speed up stuff. Some fiddling was necessary, though. I would look closely at intel docs for MOVAPS and what other people have to say, before blinding implementing it. Might turn out okay though, coupled with MOVNTPS non-temporal store.


Once you're done playing around with blit optimizing, have a look at it nevertheless - might turn out to give some nice speed gain.

Time to hit the sack...

I'm also looking into improving my blitting speed, would it be a good idea to split up the screen in scanlines to do the copy? MOVNTQ is not supported on my old CPU :(
Posted on 2004-03-22 20:42:50 by x86asm

would it be a good idea to split up the screen in scanlines to do the copy?

You need to do that anyway, because of surface pitch. If MOVNTQ is not supported on your CPU, a pure MMX copy loop is probably the fastest you'll get ("rep movsd" is okay, btw, on intel class processors because of dedicated hardware for this).
Posted on 2004-03-23 03:17:04 by f0dder