Precise because we are trying to measure the number of cycles executed within an interval of code. Small code because we are under windows and are not in control.

Attached is a RadASM project for timing code fragments, and I'd be interested if others could try it out to let me know if I've made some errors. The code should return a measurement of zero on all systems.

This is nothing great or new - just an example of how to time code. Agner Fog, Michael Abrash, and others have created libraries to time code. I have consulted the sources at Intel and AMD as well.

It takes 21000 cycles to load a code cache line from memory on my PC (ouch!). This is why the timing loop is executed once before the actual timing. Hence, you could not do this and time your memory. I was able to time single instructions with a great deal of repeatablity.

(Use version below.)
Posted on 2002-03-02 23:24:12 by bitRAKE
This is the results on my PC(800mhz 256mb ram):

First Try:

OverHead: 243 cycles/byte
MyPROC: 1 bytes, 1000 loops, -91000 cycles

Second Try:

OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 3 cycles

Third Try:

OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 1 cycles

:) Nice Tool :)
Posted on 2002-03-02 23:47:54 by stryker
The tool is vkim's, yes it is very nice!

Don't close the debug window and run the program several times? That is strange that the timing is off by a couple cycles in 1000 loops? Should be a multiple of 1000 if there is a timing error in the loop calculation. Is any value occur more often than others?

Here is what I get (Athlon TB):
OverHead: 77 cycles/byte

MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
This means the timing overhead is 77 cycles and is being subtracted everytime a timing is made. Also after 1000 itterations, there was no error - nothing is being timed.
Posted on 2002-03-02 23:50:38 by bitRAKE
Actually, I close it everytime it finished its job :). Sometimes I get 0 cycles. Yes, 1 2 and 3 are the usual values that pops up.

Here's another one:

OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 3 cycles
OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 4 cycles
OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 2 cycles
OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 1 cycles
OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 1 cycles
OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 3 cycles
OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 3 cycles
OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 4 cycles
OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 3 cycles
OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 1 cycles
OverHead: 271 cycles/byte
MyPROC: 1 bytes, 1000 loops, -118997 cycles
OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 152 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles

I must have a buggy PC :). I downloaded SiSandra Soft a few months back and it turns out, I always have different speed at different times(especially RAM) :)
Posted on 2002-03-02 23:53:28 by stryker
No, means I have to fix the code for Pentiums. :) What OS are you running. I know how to fix the OverHead calculation being off, but I don't know about the 1-4 cycles over? That is strange.
Posted on 2002-03-02 23:59:29 by bitRAKE
Intel P3 800 mhz
256mb RAM
Windows 2000 SP2 (*Not needed: all patches too :) *)
Posted on 2002-03-03 00:12:57 by stryker
bitRAKE: I get exactly the same result as you, on an athlon TB as well.

Thomas
Posted on 2002-03-03 03:35:01 by Thomas
*ahem*

Here we go:

OverHead: 80 cycles/byte
MyPROC: 1 bytes, 1000 loops, -3000 cycles
OverHead: 80 cycles/byte
MyPROC: 1 bytes, 1000 loops, -3000 cycles
OverHead: 80 cycles/byte
MyPROC: 1 bytes, 1000 loops, -3000 cycles
OverHead: 80 cycles/byte
MyPROC: 1 bytes, 1000 loops, -3000 cycles
OverHead: 80 cycles/byte
MyPROC: 1 bytes, 1000 loops, -3000 cycles
OverHead: 80 cycles/byte
MyPROC: 1 bytes, 1000 loops, -3000 cycles
OverHead: 80 cycles/byte
MyPROC: 1 bytes, 1000 loops, -3000 cycles
OverHead: 80 cycles/byte
MyPROC: 1 bytes, 1000 loops, 11384 cycles
OverHead: 80 cycles/byte
MyPROC: 1 bytes, 1000 loops, -3000 cycles
OverHead: 80 cycles/byte
MyPROC: 1 bytes, 1000 loops, -3000 cycles
OverHead: 80 cycles/byte
MyPROC: 1 bytes, 1000 loops, -3000 cycles
OverHead: 80 cycles/byte
MyPROC: 1 bytes, 1000 loops, -3000 cycles

This looks... weerd.

I have a AMD Athlon Thunderbird at 1333 mHz

PS: Of course I forgot to mention this will be a great tool when finished :alright:
Posted on 2002-03-03 03:40:48 by Qweerdy
- Athlon 800mhz, 1536MB RAM


OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 41 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
OverHead: 77 cycles/byte
MyPROC: 1 bytes, 1000 loops, 0 cycles
Posted on 2002-03-03 04:25:31 by bazik
Hi bitRAKE,
Today I don't do C/C++ anymore.. so please do not expect too much support from this code. But it worked perfectly CPU after CPU.. originally I wrote it on my first Pentium.

The fact that it tests several times the same function is not casual.

Use it this way:



< set up any register as you wish >
< set up stack as you wish >
MOV [PROFILEROUTINE],OFFSET YourRoutineToBeTested
CALL PROFILE
< now your registers and stack will be just like if you called YourRoutineToBeTested directly, >
< instead of going through PROFILE. Also, also, the number of CPU cycles it took to execute >
< YourRoutineToBeTested is stored in the unsigned 64 bit [PROFILECYCLES] >


PROFILE will then call YourRoutineToBeTested, and return in PROFILECYCLES how many CPU cycles it took.
This is the main quality of this code: it is very transparent. Accross each call to the routine to be tested (there will be several calls, to "stabilize" caches and all), your registers and stack will be preserved, and when PROFILE returns, the registers and stack will be in the same state as they would have been if you called your routine directly. PROFILE is very transparent.
I like tools which are easy to use and powerful at the same time.

NOTE: add or remove the WBINVD instruction to test cached or uncached situations (you can't use WBINVD in Windows, but in Dos you can).

Disclaimer: THIS IS VERY OLD CODE. It was originally written for WatcomC++ 10.x, using #pragma, then rewritten to work with WatcomC++ 11.x, using inline asm and 64bit results instead of 32bit. That should have made it work also on VisualC++, but I never tested it. It was abandoned in my hard disk.. I hope though it's useful to you and/or/xor anybody else.
It wasn't designed for performance *but* for accuracy, reliability and consistency. It will self-adapt to any future CPU.. it did it flawlessly as far as I know until now anyway.

Greets,
Maverick




unsigned __int64 PROFILECYCLES;
unsigned __int64 PROFILEEMPTY;
unsigned long PROFILEROUTINE;
unsigned long PROFILERETURN;
unsigned long PROFILEINEAX;
unsigned long PROFILEINEBX;
unsigned long PROFILEINECX;
unsigned long PROFILEINEDX;
unsigned long PROFILEINEFL;
unsigned long PROFILEOUTEAX;
unsigned long PROFILEOUTEBX;
unsigned long PROFILEOUTECX;
unsigned long PROFILEOUTEDX;
unsigned long PROFILEOUTEFL;
unsigned long PROFILERETADDR;

__declspec(naked) void PROFILE(char* Address) {
asm {
mov [PROFILEINEAX],eax ; saves INPUT EAX (will be trashed by CPUID)
mov [PROFILEINEBX],ebx ; saves INPUT EBX (will be trashed by CPUID)
mov [PROFILEINECX],ecx ; saves INPUT ECX (will be trashed by CPUID)
mov [PROFILEINEDX],edx ; saves INPUT EDX (will be trashed by CPUID)
pushfd
pop [PROFILEINEFL] ; saves INPUT CPU EFLAGS
pop [PROFILERETURN] ; saves return address
call @geteip1
@geteip1:
jmp @skip1
ret
@skip1: pop eax
add eax,2
push [PROFILEROUTINE] ; saves requested PROFILEROUTINE
mov [PROFILEROUTINE],eax ; first we'll profile a simple RET
;
call @geteip2
@geteip2:
pop eax
add eax,11
mov [PROFILERETADDR],eax
@jmp2: jmp @jmp3 ; make sure it gets cached
;
call @geteip3
@geteip3:
pop eax
add eax,11
mov [PROFILERETADDR],eax
@jmp3: jmp @jmp4 ; profile for real (well, let it set up)
;
call @geteip4
@geteip4:
pop eax
add eax,11
mov [PROFILERETADDR],eax
@jmp4: jmp @jmp5 ; profile for real (well, let it set up again)
;
call @geteip5
@geteip5:
pop eax
add eax,11
mov [PROFILERETADDR],eax
@jmp5: jmp @jmp6 ; profile for real (well, let it set up one final time)
;
call @geteip6
@geteip6:
pop eax
add eax,11
mov [PROFILERETADDR],eax
@jmp6: jmp @jmp7 ; profile for real
;
mov eax,[PROFILECYCLES+0]
mov edx,[PROFILECYCLES+4]
mov [PROFILEEMPTY+0],eax ; saves RET cycles count
mov [PROFILEEMPTY+4],edx
pop [PROFILEROUTINE] ; restores requested PROFILEROUTINE
;
call @geteip7
@geteip7:
pop eax
add eax,11
mov [PROFILERETADDR],eax
@jmp7: jmp @jmp8 ; make sure it gets cached
;
call @geteip8
@geteip8:
pop eax
add eax,11
mov [PROFILERETADDR],eax
@jmp8: jmp @jmp9 ; profile for real
;
call @geteip9
@geteip9:
pop eax
add eax,11
mov [PROFILERETADDR],eax
@jmp9: jmp @jmp10 ; make sure it gets cached
;
call @geteip10
@geteip10:
pop eax
add eax,11
mov [PROFILERETADDR],eax
@jmp10: jmp @jmp11 ; profile for real
;
call @geteip11
@geteip11:
pop eax
add eax,11
mov [PROFILERETADDR],eax
@jmp11: jmp @profile ; make sure it gets cached
;
mov eax,[PROFILEEMPTY+0] ; subtracts simple RET overhead
mov edx,[PROFILEEMPTY+4]
sub [PROFILECYCLES+0],eax ; saves cycles, low 32bit
sbb [PROFILECYCLES+4],edx ; saves cycles, high 32bit
mov eax,[PROFILEOUTEAX] ; gives OUTPUT EAX
mov ebx,[PROFILEOUTEBX] ; gives OUTPUT EBX
mov ecx,[PROFILEOUTECX] ; gives OUTPUT ECX
mov edx,[PROFILEOUTEDX] ; gives OUTPUT EDX
push [PROFILEOUTEFL]
popfd ; gives CPU EFLAGS
jmp [PROFILERETURN] ; returns to caller
@profile:
;wbinvd
mov eax,[PROFILECYCLES+0] ; touches caches
mov edx,[PROFILECYCLES+4]
mov eax,[PROFILEINEAX]
mov ebx,[PROFILEINEBX]
mov ecx,[PROFILEINECX]
mov edx,[PROFILEINEDX]
mov eax,[PROFILEINEFL]
mov eax,[PROFILEOUTEAX]
mov ebx,[PROFILEOUTEBX]
mov ecx,[PROFILEOUTECX]
mov edx,[PROFILEOUTEDX]
mov eax,[PROFILEOUTEFL]
mov eax,[PROFILEROUTINE]
mov ecx,32
@stack: push eax
loop @stack
add esp,128
xor eax,eax
cpuid ; flush pipelines
rdtsc
mov [PROFILECYCLES+0],eax ; saves TSC, low 32bit
mov [PROFILECYCLES+4],edx ; saves TSC, high 32bit
xor eax,eax
cpuid ; flush pipelines
mov eax,[PROFILEINEAX] ; restores INPUT EAX
mov ebx,[PROFILEINEBX] ; restores INPUT EBX
mov ecx,[PROFILEINECX] ; restores INPUT ECX
mov edx,[PROFILEINEDX] ; restores INPUT EDX
push [PROFILEINEFL]
popfd ; restores CPU EFLAGS
call [PROFILEROUTINE] ; calls routine to be tested
mov [PROFILEOUTEAX],eax ; saves OUTPUT EAX
mov [PROFILEOUTEBX],ebx ; saves OUTPUT EBX
mov [PROFILEOUTECX],ecx ; saves OUTPUT ECX
mov [PROFILEOUTEDX],edx ; saves OUTPUT EDX
pushfd
pop [PROFILEOUTEFL] ; saves OUTPUT CPU EFLAGS
xor eax,eax
cpuid ; flush pipelines
rdtsc
xchg [PROFILECYCLES+0],eax
xchg [PROFILECYCLES+4],edx
sub [PROFILECYCLES+0],eax ; saves TSC, low 32bit
sbb [PROFILECYCLES+4],edx ; saves TSC, high 32bit
jmp [PROFILERETADDR]
}
}
Posted on 2002-03-03 05:54:48 by Maverick
I looked at my code.

Doesn't make a difference.. but:

__declspec(naked) void PROFILE(char* Address) {

should be changed to:

__declspec(naked) void PROFILE() {

I said it was C/C++ but it's really pure asm.. and I've always used it from inline asm, so..

Also, I said that today I don't do C/C++ anymore. It's not correct, since I'm under contract and working with C/C++. I just meant that when I have a choice I stay away from these languages, nowadays. :)
Posted on 2002-03-03 06:01:32 by Maverick
Wow.. how much time passed. I looked with some interest at my code, and I noticed all those funny jmps.. it's because WatcomC++'s inline asm won't let you do 32bit relative jmps, but only 8 bit.
Some jmps in the code would have needed 32 bit, so I didn't want to bother to hardcode the instructions using db (jmp @profile).
Other jmps would have been 8bit relative offset direct to @profile, but then I would have had to calculate the EIP+x necessary to return to the caller (without a CALL/RET, but with a jmp/jmp), and I was too lazy to do it for each of the 10 calls to the real profile routine. :D
So I didn't bother and just used one scheme for all "calls". No real performance difference, anyway. The code is hard to read, but works perfectly.
Posted on 2002-03-03 06:13:09 by Maverick
Hi bitRAKE, on my PIII 700mhz with Win2k it always settles on 6 cycles after a couple if runs.

If anyones interested I always used Agner Fogs suggestion for timing code;

Initilizations Macro
&nbsp &nbsp fldpi
EndM

CodeToTime Macro
&nbsp &nbsp fmul st,st
EndM

Finalizeations Macro
&nbsp &nbsp fstp st
EndM

.data?
tVal dd 32 dup (?)

.code
mov edx,31
tLp:mov i,edx
&nbsp &nbsp rdtsc
&nbsp &nbsp cpuid
&nbsp &nbsp and v,0

Initilizations

&nbsp &nbsp rdtsc
&nbsp &nbsp mov t,eax
&nbsp &nbsp cld
&nbsp &nbsp REPT 8
&nbsp &nbsp nop
&nbsp &nbsp EnDM

CodeToTime

&nbsp &nbsp clc
&nbsp &nbsp rdtsc
&nbsp &nbsp sub eax,t
&nbsp &nbsp mov edx,i
&nbsp &nbsp sub eax,42
&nbsp &nbsp mov tVal,eax

Finalizeations

dec edx
jns tLp

The sub eax,42 line removes the overhead on my PC and therefore nothing in the CodeToTime macro will return 0.

You simply need a seperate display procedure to output the 32 values in the tVal array.
Posted on 2002-03-03 07:37:45 by Eóin
Don't want to "sell" my code, but honestly I think that it's much more comfortable than having to use separate "initialization" and such routines, and to have to subtract some cycles depending on what machine it's running on. Also, there's the issue of stability of results accross repeated tests.

Greets,
Maverick
Posted on 2002-03-03 08:02:21 by Maverick
Thanks for the replies. I did want to have something that determined the adjustment time without having to fudge the figures to make them work. It needs some work. :)

Qweerdy, your results are very odd. If the OverHead was 77 then it would produce the correct timing of zero, but for some reason it's consistently producing and OverHead calculation of 80. ???I don't know???

So far it only works for some Athlon TB's.
Posted on 2002-03-03 10:26:35 by bitRAKE
My routine automatically calculates the overhead, if it is of any help.

Greets,
Maverick
Posted on 2002-03-03 10:29:10 by Maverick
Thanks for the code Maverick and E?in - I was hoping to get some feedback on the methods others were using. I like vkim's debug window, so I thought it'd be cool to write a timing layer in macros.

Here is another version that insures the prep code is in the instruction cache - this doesn't seem to be required of Athlon TB's. :) Also, I clear EAX before CPUID.
Posted on 2002-03-03 12:31:20 by bitRAKE
BiTRAKE,

I would be interested to see if the variations that occur in ring3 are handled when performing timings on small sections of code on a one pass basis.

I worked with a number of techniques over time, RDTSC, API timing functions etc .... but they all suffered the same percentage variation due to ring3 suffering priority problems from the OS.

I ended up using large samples to get the duration up so that this variation was reduced to 1% or lower.

I will be interested to see what the results are with this method.

Regards,

hutch@movsd.com
Posted on 2002-03-03 15:08:10 by hutch--
Eagle's Eye View:

OFFTOPIC - Aha!!! so this is the reason why your asking on how to clear the cache. :)
Posted on 2002-03-03 19:29:45 by stryker
On my old p166 mmx I consistently get this:

OverHead: 57 cycles/byte
MyPROC: 1 bytes, 100 loops, 100 cycles

I wasn't able to get a single different result! Nice!

I'll test it again on my Athlon TB, but I think you said you solved that problem.
Posted on 2002-03-04 10:52:37 by Qweerdy