hello all,
i'm going to try a project where i have to optimize code for speed - therefore i need a nice testbed so that i can time blocks of code - as there will be little to no difference in time when just changing some commands in an inner loop (queryperformancecounter... timing the code in the loop and not outside the loop) i'd need the possibility to actually measure the cpu cycles i gain through my optimizations.
is there any nice example on how to do that, any links? is it possible? :)
thx a lot in advance
Posted on 2003-05-25 12:30:34 by BugByter
When testing code for speed I usually use this instruction sequence:
I only use the lower 32-bits of the TSC because my CPU is 900Mhz AMD Athlon and 2^32 is ~4 billion. That is like what 4 seconds on my AMD CPU quite a long delay. I assume that EAX is treated as an unsigned integer



rdtsc
push eax
......Code to be tested.....
rdtsc
pop ebx
sub eax,ebx

;Now EAX contains the amount of clk cycles it took to execute the tested code.



I believe someone here might have a better way, but this seems to work fine for my purposes/
Posted on 2003-05-25 14:25:16 by x86asm
Taking cycles is very tricky, as I found.


; first, try to switch to another thread,
; so that next time we have a longer timeslice of cpu
invoke Sleep,0

; now save register ESI, otherwise your program
; might crash if you use a window loop later.
push esi

mov ecx,500 ; give it 500 tries to improve itself (cache and stuff)
mov esi,-1
_do_it_again:
push ecx
RDTSC
push eax
;===[[ Your code here >>===\



;=====================/
RDTSC
pop edx
sub eax,edx
.if eax<esi ; unsigned compare
mov esi,eax ; in ESI, we have the minimum cycles count that the code ever took.
.endif
pop ecx
dec ecx
jnz _do_it_again
mov eax,esi
pop esi ; restore important register

sub eax,9 ; exclude additional time that the two 'RDTSC' and the 'push eax' took

PrintText "Count of cycles the code takes:"
PrintDec eax



Now. This will work fine. That "sub eax,9" might need fine-tuning according to your cpu. If you use the FPU inside your tested code, you will get awful results, like 3000 cycles taken for "fadd". If you are gonna test FPU code, use this before all code above (before the "invoke Sleep,0") :


.data
f_temp1003 real4 1.03
.code
xor ecx,ecx
.while ecx<100
; start using random fxxx commands, and some memory, including local variables !
; here, I use no local vars , but you should
fld f_temp1003
fld ST
fmul
fld1
fadd
fstp f_temp1003
inc ecx
.endw


I am with k6-2 450MHz and 64MB @ 66 MHz
on my PC, accessing a byte, that isn't in cache, takes 6.5 cycles. If you access randomly, the cycle count is greater, but if you access an array of bytes, this is the approximate 'penalty' at my PC.
Posted on 2003-05-25 14:58:53 by Ultrano

Taking cycles is very tricky, as I found.


; first, try to switch to another thread,
; so that next time we have a longer timeslice of cpu
invoke Sleep,0

; now save register ESI, otherwise your program
; might crash if you use a window loop later.
push esi

mov ecx,500 ; give it 500 tries to improve itself (cache and stuff)
mov esi,-1
_do_it_again:
push ecx
RDTSC
push eax
;===[[ Your code here >>===\



;=====================/
RDTSC
pop edx
sub eax,edx
.if eax<esi ; unsigned compare
mov esi,eax ; in ESI, we have the minimum cycles count that the code ever took.
.endif
pop ecx
dec ecx
jnz _do_it_again
mov eax,esi
pop esi ; restore important register

sub eax,9 ; exclude additional time that the two 'RDTSC' and the 'push eax' took

PrintText "Count of cycles the code takes:"
PrintDec eax



Now. This will work fine. That "sub eax,9" might need fine-tuning according to your cpu. If you use the FPU inside your tested code, you will get awful results, like 3000 cycles taken for "fadd". If you are gonna test FPU code, use this before all code above (before the "invoke Sleep,0") :


.data
f_temp1003 real4 1.03
.code
xor ecx,ecx
.while ecx<100
; start using random fxxx commands, and some memory, including local variables !
; here, I use no local vars , but you should
fld f_temp1003
fld ST
fmul
fld1
fadd
fstp f_temp1003
inc ecx
.endw


I am with k6-2 450MHz and 64MB @ 66 MHz
on my PC, accessing a byte, that isn't in cache, takes 6.5 cycles. If you access randomly, the cycle count is greater, but if you access an array of bytes, this is the approximate 'penalty' at my PC.



Cool I'm going to use this from now on ;)
Posted on 2003-05-25 15:25:56 by x86asm
hello, thanks a lot for your ideas!

actually i'm going to use inline asm in a c++ dll, how would i have to change the code to make use of it?
i'm going to take the asm output of msvc for a very often called function and put all of it into an __asm statement - then i will try to make use of mmx/3dnow optimizations to try and get a real boost (of course using the c source to easier identify spots to optimize, eg matrix transformations et al).
by the way: are there any special things i have to look after when using __asm in msvc? does it do something bad, set up a frame or something like that?

thanks a lot for your help!
Posted on 2003-05-25 15:55:30 by BugByter

hello, thanks a lot for your ideas!

actually i'm going to use inline asm in a c++ dll, how would i have to change the code to make use of it?
i'm going to take the asm output of msvc for a very often called function and put all of it into an __asm statement - then i will try to make use of mmx/3dnow optimizations to try and get a real boost (of course using the c source to easier identify spots to optimize, eg matrix transformations et al).
by the way: are there any special things i have to look after when using __asm in msvc? does it do something bad, set up a frame or something like that?

thanks a lot for your help!


I havent encountered a problem while using __asm keyword under MSVC. The syntax is the same no different, except just define all the REAL4's that Ultrano defined as floats in your C code.
Posted on 2003-05-25 15:57:52 by x86asm