Bcak on the topic:

It's cool, if what I'm getting as result is correct, the MOVQ/MOVNTQ blit is giving me around 30% speed improvement when just reading/writing in system memory, next I'll try including that in my sysmem->vmem routine to see what I get.

And I'll also try using the 128bits registers XMM? too, see if it can speed up things even more!
Posted on 2004-03-24 18:26:14 by persil
Hmmm... it really seems the bus or whatever else is a serious bottleneck... As I'm getting NO improvement at all with this...

I'm stuck with a 8ms screen update... Impossible to get better...

At least I got down from ~3ms to 1~2ms for my sysmem->sysmem blitter.
Posted on 2004-03-24 18:30:55 by persil
hmmm... no use... I would have to use aligned memory... but to be sure that both input and destination are aligned...!!! I give up :) It's already an improvement at least... but I find it hard to believe there's no way to speed up mem->vram updates :(
Posted on 2004-03-24 18:43:11 by persil
All DX surfaces ought to meet the 128bit alignment requirement of SSE2 data types. You can set up a SEH (the FS:0 stuff), do the 128bit blitting. If an exception triggers, check if the cause is the blit, and revert to MOVQ+MOVNTQ.

I would have expected the MOVNTQ stuff to speed up things at least a bit for the sysme->vidmem blitter, if for nothing else then at least that you don't get as much cache trashing. But sure, this is one of those RAM or BUS bandwidth limit problems.
Posted on 2004-03-25 00:37:49 by f0dder
Yeah, if it improves memory transfers that much, why not with the vram??? Well, I thought of this: it must be a latency problem between sysmem and vram...

Why? Because I tried using one of my FX functions which, for example, fades a source surface to gray by a percentage and writes the result to a destination surface. I used it to blit to the primary surface, which is of course in vram, and it gave me the EXACT same speed as when using our super-optimized MOVNTQ blit... So, conclusion anyone? There is a certain latency when writing to vram, something that acts like the latency when doing a DIV. There is that much data which you can buffer and then the processor stalls until it can writes again. Or at least, this is how I see it...

Anyway, that made me think a lot. The idea of making a deffered rendering would make a lot more sense then, because for every pixel where there is more calculations, the previous vram write would have time to finish and it would almost nullify this latency, just like when optimizing using parallel instructions instead of ones depending on each other. Am I right?
Posted on 2004-03-25 18:05:32 by persil
Throughput (bandwidth), not latency :)
Posted on 2004-03-26 08:32:20 by f0dder
okay, bandwidth :)

But... what is that SEH stuff anyway? :confused:
Posted on 2004-03-26 08:49:57 by persil
Structured Exception Handling, on windows. It basically allows you to handle errors without if blocks (conditional jumps) - handle things like access violations, general protection faults, and invalid opcodes, rather than crashing.

You'd set up a SEH frame (google this board for SEH, you should fine stuff - if not, goto http://www.jorgon.freeserve.co.uk/ and look for "exception", and google as well) and then call your SSE2 blitter. If it causes an exception, check why - if it's because of unaligned data, call a MOVQ+MOVNTQ blitter or something instead.
Posted on 2004-03-26 09:15:30 by f0dder
Hi again!

I've continued to work on other aspects of the program and the strange thing I was explaining earlier occured again, but this time I can reproduce it in a WEIRD way.

When running my test, it runs at 130fps, but when Winamp 5's notification appears (if you've used it you know what I mean), the frame rate goes up to ~200fps, and I know that the game does the exact same thing. The other thing I noticed, which makes sense, is that when that happens, the CPU usage goes to 100%, contrary to the rest of the time where it somehow doesn't exceeds 50%...

What the hell is going on?
Posted on 2004-03-29 18:53:35 by persil
Remember me?

I've tried using GDI and speeds are actually 2X faster than with using DDraw's Blt or my own software blitter. So I guess that GDI, it, is AT LEAST, using some form of hardware acceleration for blits.

Damn I'm really starting to hate DirectX!!! If it goes on I'm gonna use simple windows APIs to program my game and forget about directdraw, directinput and directsound!!! Well, maybe not, but I sure won't be using DDraw's blt... I'll write my wrapper as such to be the most GDI compatible possible.

Ha, can you believe it. GDI being faster. In what world am I living???:confused:
Posted on 2004-04-11 21:01:52 by persil
Try getting away with that in 3D - then realize how craptastic DX is and use OGL instead lol - despite all my rage, I am still just a rat in a cage :tongue:
Posted on 2004-04-12 10:08:49 by Homer