Hello all,

I've programmed Intel asm for a while, but I'm new to ARM stuff. I found lots of nice tutes on the net, but I just can't figure out the idea behind the code I found in a program called vgagapi. It's for PocketPC, and the code is in a pixel copy loop.

...
ldr  r0,
ldr  r1,
ldr  r2,
ldr  r3,
ldr  r0,
ldr  r1,
ldr  r2,
ldr  r3,
ldr  r0,
ldr  r1,
ldr  r2,
ldr  r3,
...


I'm sure it has something to do with caching, but what is it?
Posted on 2006-12-11 06:59:33 by dzolee
Just a guess: it is touching every 16th byte of memory to prefetch a cache line.
Posted on 2006-12-11 12:28:14 by Dr. Manhattan
True, and the speedup against the simple

; note: custom macro-preprocessor here
testloop:
push r14
push r0-r12
mov r14,4608
add r14,r14,46
@@:
subs r14,r14,1
ldmia r1!,{r2-r12} ; 44 bytes
stmia r0!,{r2-r12}
bne @B
pop r0-r12
pop r15

is 445ms against 560ms
(100 times copying a 320x320 16bpp screen from backbuffer to frontbuffer)
That makes 44MB/s BW against 35MB/s

P.S: fused it in my memcpy() proc, at 320bytes/iteration, and achieved 420ms. Without caching, the proc was taking 720ms >_< . Hmm, now there's much more to do in my games' game-loop with those extra 160ms/s :D .
Posted on 2006-12-11 15:29:20 by Ultrano
thanks a lot.
Posted on 2006-12-12 03:54:00 by dzolee
Hi guys

I am interested in this topic, but i think i didnt quite get it. Where do you put the sequence that touches the memory, before the copy loop? And if its a 'move' memory routine, should i touch the source buffer, destination buffer or both?
Like in the Ultrano's example loop, where do you put the 16 'ldr r0,' instructions?


Eugen
Posted on 2006-12-15 03:23:34 by Eugen
The sequence touches 256 bytes. So, put it inside the loop, right before the series of ldmia/stmia pairs. So, in each loop you: first touch the source's 256 bytes, then copy 256 bytes.

And it seems with more than 256 bytes at a time, you can get better results.

You won't need to touch the destination if the destination is 4-byte aligned, you'll just waste cycles (the "write-through" RAM access schematic won't need the previous data of the destination). I guess you'll improve performance if you touch the destination in the case it is not 4-byte aligned. But I haven't tested it yet.
Posted on 2006-12-15 03:39:52 by Ultrano

Ultrano, thanks for your explanations. However, i've done what you suggested, but i dont get any speed improvements with the sequence outlined with '*', but instead i get a slowdown of about 10%. Do you have any idea what i do wrong?



SpeedUp_Test1
stmfd SP!,{r0-r12,lr}

ldr r0,=buff1
ldr r1,=buff2
mov r14,#4096

loop_copy

sub r14,r14,#256

;************************
ldr  r4,
ldr  r5,
ldr  r2,
ldr  r3,
ldr  r4,
ldr  r5,
ldr  r2,
ldr  r3,
ldr  r4,
ldr  r5,
ldr  r2,
ldr  r3,
;************************

ldmia r0!,{r2-r12} ;52
stmia r1!,{r2-r12}

ldmia r0!,{r2-r12} ;52*2=104
stmia r1!,{r2-r12}

ldmia r0!,{r2-r12} ;52*3=156
stmia r1!,{r2-r12}

ldmia r0!,{r2-r12} ;52*4=208
stmia r1!,{r2-r12}

ldmia r0!,{r2-r11} ;52*4+48=256
stmia r1!,{r2-r11}

cmp r14,#256
bgt loop_copy


ldmfd SP!,{r0-r12,pc}



Eugen
Posted on 2006-12-15 04:16:26 by Eugen
Hmm... what is the cpu and chipset you're testing on?
I'm on ARM9 inside a PXA255 400MHz, 100MHz SDRAM.
Btw, note you've skipped
ldr  r2,
and your cmp should be placed several lines above.
Posted on 2006-12-15 04:30:24 by Ultrano

Btw, note you've skipped
ldr  r2,
and your cmp should be placed several lines above.


Yep, fixed those, no change.  :)

The CPU is an ARM 946-E. The CPU is inside a custom chip, developed by my company. I noted in the booting area of the code (which i didnt write) that only the Instruction cache is enabled, but the buffers in my test are placed in the code area, right after the code i pasted here.

Posted on 2006-12-15 04:44:10 by Eugen
The cpu won't cache that data no matter where it is, then. The two caches are independent, and at one time you can have both caches have the same contents!
For this reason putting data inlined in your code is not good... which GCC does heavily ...
Posted on 2006-12-15 04:58:05 by Ultrano