Sometimes it looks useless to unroll a loop... why is that?
Besides what other optimizations for memory to memory (video)data move do you suggest?

For example we have to do this whole screen move 2 times in HE (once for the terrain data and clear screen and second for final backbuffer to video send) each of this memory transfers takes approx 10ms on my old P2/400 so it limits our max framerate to approx 50fps.

Durnig game other things reduce it even more but they are distributed in different routines while this is just one simple routine...

So we try to unroll .... but get no improvement what so ever :(

We have to move line by line because sometimes there is a gap between lines (called Pitch) but to keep things simple i removed that from code samples

Here is the sample code:



.data?

;---------------------------------------------------
; test on this data buffers
; they represent one 800x600 screen
; at 16 bits per pixel
;---------------------------------------------------
src_vector dd 800*600 dup (?)
dest_vector dd 800*600 dup (?)
.code
align 4
;===========================================
; the non unrolled stuff looks like the fastest?
;===========================================
blit_manually_lines PROC
mov esi,offset src_vector
mov edi,offset dest_vector
mov edx,600 ;lines

bucla_y_blit_sys:
mov ecx,400 ; 800/2

rep movsd

dec edx
jnz bucla_y_blit_sys
ret
ENDP

align 4
;===========================================
; now the unrolled version, the same speed?
;===========================================

blit_manually_unrolled1 PROC


mov esi,offset src_vector
mov edi,offset dest_vector

mov ecx,600 ;lines

@@loop_y_unroll_1:

push ecx

mov ecx,(800/8) ;columns

@@loop_x_unroll_1:

;--------------------------------
; read in 8 pixels from source
;--------------------------------
mov eax,[esi]
mov ebx,[esi+4]
mov edx,[esi+8]
mov ebp,[esi+12]

;------------------------------------
; write 8 pixels to destination
;------------------------------------
mov [edi],eax
mov [edi+4],ebx
mov [edi+8],edx
mov [edi+12],ebp

;--------------------------------------
; next pixels address
;--------------------------------------
add esi,16
add edi,16

dec ecx
jnz @@loop_x_unroll_1

pop ecx

dec ecx
jnz @@loop_y_unroll_1

ret

ENDP



What do you think?

Anything elese we can do to improve those big time spending routines?

What is the fastest way to move memory on P2 and better machines (no P1)

PS we try MMX and the same... no improvements whatsoever...
Posted on 2002-04-28 07:59:20 by BogdanOntanu
I think you would find that when doing very many iterations that unrolled loops perform better, although with the current speed of processors this is debateable. Here's some pseudocode to help visualize


@:
Do_stuff_here
Decrement the counter
Is the counter at 0?
No-so jump backward
Yes-so proceed forward
;
; ^ Normal tightly rolled loop


@
Do_stuff_here
Do_stuff_here
Do_stuff_here
Do_stuff_here
Decrement the counter by 4
Is the counter at 0?
No-so jump backward
Yes-so proceed forward
;
; ^ Loop unrolled by a factor of 4.
; Counter decrementing and
; testing are only carried out
; every 4 iterations, thus saving
; a bit of time.


So I bet if you were to take your rolled loop, and unrolled loop, and set the counters to say, 0FFFFFFFFh , you would eventually see the unrolled loop come out as the winner.
Posted on 2002-04-28 08:31:40 by iblis
Unfortunately, there is great difference between what is best on each processor. Looking at the results in ( this thread ), it appears that a MMX/FPU loop is the fastest on the P2, but not by much over REP MOVSD.
Posted on 2002-04-28 09:31:06 by bitRAKE
I had a bit of a look at this a while back, and I found (after some research) that the P6 architecture special cases "rep movsd" when the data size to be moved is equal to or greater than the L2 cache.

When ecx is lower, rep movsd is bested by a "mov eax, / mov , eax" pair.

But using an MMX register was consistantly as high as the special cased "rep movsd" even when lower amounts of data were transfered.

Mirno
Posted on 2002-04-28 11:26:32 by Mirno
The ammount of data we transfer is allways the same:
800x600x2bytes=960.000 bytes and it takes 10ms.

this could be even greater if we switch to higher resolutions (ie 1024x768 etc) i guess no CPU has 2M of cache nowdays anyway...

so even at 800x600 its greater than my L2 cache=512k but L2 is not going to help much there when we are going to write into the video memory...though it will help other cases and the reads...

So u guys think we should use only "rep movsd" ?
Posted on 2002-04-28 11:33:13 by BogdanOntanu
Afternoon, BogdanOntanu.

Another idea would be to break the loops up into L1-sized-chucks (32kb on a P2?). Cache-misses would be a major cause of the slow code.

Just have another loop around what you've already got; have the inner loops process ~30kb maximum at a time.

960,000 bytes / 30,000 bytes (leaving 2768 bytes in L1 for code) == 30 outer loops.

I'm definitely no optimizer guru and not very knowledgable about this, so maybe someone else can shed some light on this idea?

Cheers,
Scronty
Posted on 2002-04-28 17:42:41 by Scronty
Correct me if I'm wrong but the L1 cache on the P2 is 16k data + 16k code.
(Would that mean that setting the data sizes to 16,384 bytes / loop prevent L1 cache misses or do you need to step that down somewhat?)
Posted on 2002-04-28 18:11:48 by grv575
Bogdan,

It was my experience some time ago when I did everything with integer code to try and improve on REP MOVSD and none were faster. I then tried unrolled loops with MOVQ and it was still slower but not by much.

Allowing that you have done all the other things like ensuring the source and target memory is aligned properly, I think the only chance you have is to try something like what Scronty has suggested to see if you get some speed increase that way.

I think BitRAKE did a very fast PIII or PIV version but it would only run on a PIII or later so it is not bery useful with general purpose stuff.

With normal stuff, I think the memory access speed is becoming a major factor in transfer speed but there may be a trick with cache access that will make it go faster.

Regards,

hutch@movsd.com
Posted on 2002-04-29 07:40:30 by hutch--
Hi Hutch

yes that was my testing results also, REP MOVSD looks like the fastest and sure is the smallest ;)

memory align, hmm that was not taken care of ... but i was thinking that on P2 and greater CPUs this does not matter anymore... am i wrong?

Besides on the other thread mentioned here the memory blocks thansfered were pretty small (64bytes sometimes times LOL)

WWhat we need is to fast transfer 1-2M or more data and at a sustained rate (more than 30-50 times per second that is)

Thx all
Posted on 2002-04-30 01:24:12 by BogdanOntanu
Bogdan,

On anything from PII up memory alignment is important to prevent double grabs at the location if its not aligned. With the block transfer you have in mind, it should be easy enough to organise and from memory, its worth aligning it at 16 bytes if you can do it.

The technical data I have seen is that REP MOVSD on aligned data over 64 bytes in size is very hard to beat and my own testing showed that. Perhaps a dedicated MOVQ algo that was unrolled may get up to pace but I would be tempted to try the cache suggestion that Scronty made as it may solve one problem in transfer speed.

Regards,

hutch@movsd.com
Posted on 2002-04-30 01:40:31 by hutch--
Bogdan, to solve any alignment problems, use VirtualAlloc for your
screen buffer, you'll get 4k aligned memory. Test "rep movsd", a
MMX xfer routine, and one of those fancy things using prefetch.

I'm interested in hearing how much time can be saved here, I didn't
really think a systemmem->gfxmem copy was too bad these days?
Posted on 2002-04-30 04:28:27 by f0dder
Nowadays (read: 3 years ago) system->video ram copy bandwidth of 300 MB/s are very normal. So anything less (in the order of 70/90 MB/s) make me think that the driver doesn't enable UCWC (uncached write-combining), or just that the processor doesn't support it (old Pentiums).

About alignment, the famous 16 bytes that pop up every time are relative to the 486, although a lot of people still use that value, modern CPU's have 32, 64, 128 bytes cachelines. E.g. the Pentium has 32, the K7 64, etc..

I suggest first to profile a pure memory fill (i.e. the equivalent of stosd), e.g. clear the screen. So at least you're sure about how much bandwidth you have with the video card.
Then check the UCWC of above. Under Dos you can use an utility called FASTVID to enable UCWC.. and some BIOSes have that option directly available.

One question Bog: what video card are you testing this on? Because, to make you an example, the old Riva TNT is extremely poor in bandwidth, while the ATI Rage Pro reachs with no effort the 300 MB/s AGP barrier. Both are quite old and cheap cards, so..

---
Ciao,
Maverick
Posted on 2002-04-30 04:57:07 by Maverick
this might be a bad solution but could it be passed compressed and then be decompressed? maybe that saves time?
Posted on 2002-04-30 15:05:02 by Hiroshimator
http://www.sgi.com/developers/technology/irix/resources/asc_cpu.html seemed interesting to me as well, it was written for P3, but I think you can use parts of it for P2 as well.

It tells about things like prefetching, getting data to fit cachelines, etc.

Also there is always the intel manuals, they have a part for cache optimization.
Posted on 2002-05-02 19:56:58 by _js_
What about using the Direct Memory Access driver. Last I heard it could copy memory to/from memory and I/O.
Posted on 2002-05-21 00:27:24 by eet_1024
DMA on x86 is a joke. It's usable for sound card and harddrives,
and ... that's about it.
Posted on 2002-05-21 02:16:41 by f0dder
Last I heard DMA ran at 8088 speeds (4.77MHz)... 'course I think EISA and MCA did them at 80286 speeds (10MHz IIRC)... hrm...

Well presumably the system makers have created faster DMA's
Posted on 2002-05-21 08:57:27 by AmkG