How do I get accurate timings on todays cpus?

The following macros are inaccurate




RDTSC_START MACRO
CPUID
RDTSC
push eax
CPUID
RDTSC
pop edx
sub eax,edx
push eax ; Timing overhead value
CPUID
RDTSC
push eax ; Starting count
ENDM

RDTSC_STOP MACRO
CPUID
RDTSC
pop edx
sub eax, edx ; Subtract starting count
pop edx
sub eax, edx ; Subtract timing overhead
ENDM



Tested on a 1.4 Ghz processor
Posted on 2001-11-30 21:52:38 by grv575
It gives me clocks of 3284 for




RDTSC_START

fdiv
fdiv
fdiv

RDTSC_STOP



don't think that's an accurate reading...

also try the following c code with Sleep(5000);
It's doesn't give me a close value until I use a timer of 80 secs
Posted on 2001-11-30 22:07:42 by grv575
Remember that windows is a multitaskign environment and all sorts
of shit can go on in the background. Even if you boost your thread
priority to the max (which is normally a lame thing to do.)
Posted on 2001-11-30 22:39:44 by f0dder
Also, remember that RDTSC returns a 64-bit value in EDX:EAX - those macros ignore that fact. Using the CPUID instruction means your timing your memory loading into the instruction cache, doesn't it? That is how slow it is. :) I do a dummy iteration to load the algorithm into the cache, record the start time, run several iterations, then record the stop time - works pretty good even with windows. :) This sets up an ideal environment for timing. I also do more general timing, but this doesn't appear to be what your after.
Posted on 2001-12-01 00:17:24 by bitRAKE
I dont have the official solution for you... but a while ago i noticed the same type of thing..

if you set up your program in a for type loop:



for x = 1 to 10
get start time
for y = 1 to 1,000,000
do somthing...
end y
get stop time
display time difference
next x


You'd expect 10 clock times which has about the same value...

But in what i *really* got was something like:

1000
40
38
41
40
38
..
..
..

The point here is some "setup" threads (probably from CreateWindow or something), are still chugging a good amount of CPU time, even tho YOUR thread has moved onto your loop and timing code.. hence you get this extra long time for the "starting" loop.

My solution was simple, ignore the first set or two of data, and used the rest for numerical analysis. (How many sets to ignore is a function of how long the loop is ~ you will have to figure this out by experimentation. :) )

But anywho, there is my take on the issue... hope it helps..

NaN
Posted on 2001-12-01 00:53:14 by NaN
If you can run RDTSC from ring0, the accuracy is very good but from normal application level access of ring3, the instruction is subject to a lot of interference from other things running in the operating system which limits its usefulness.

For convenience I prefer to use the API GewtTickCount which barely has millisecond resolution in real time but if you run a sample that takes longer than a half second or so, the accuracy comes down under 1% which is plenty good enough.

Regards,

hutch@movsd.com
Posted on 2001-12-01 01:23:14 by hutch--
The only reason not to use RDTSC is if the processor doesn't support it. Just take into account what your trying to measure. If you have a long complex algorithm then it might be impossible to get an accurate "cycle count" under windows, but if your testing a short simpler algorithm that fits into the cache then you should be able to get better accuracy. The measurements are fuzzier the larger the proc - certainly some statistically calculations are in order.
Posted on 2001-12-01 01:44:01 by bitRAKE
Another problem is that the first time code is executed it takes alot longer for reasons explained in Agner Fogs Guide. I use the following and it works well. This is based on a sample from Agners Guide so all credit to him.

;Put initialation code here

rdtsc
mov Tick,eax
clc
nop
nop
nop
nop
nop
nop

; Put code to time Here \/

;/\ /\

clc rdtsc
sub eax,Tick
sub eax,VARVAL
; eax now contains Timing


You should loop this code about 16 times and get the average of the final 10 or so. Plus if you app did suffer due to multitasking it will be very obvious by a huge increase in the timing. For small procedures and the like this is very accurate.

You need to also run this loop once without any code to time. Set VARVAL to whatever timing come out, this removes the overhead from futer timings.
Posted on 2001-12-01 06:45:02 by Eóin
I did a lot of testing with different methods of timing code and the problem with RDTSC in normal ring3 access is that it suffers interference from the operating system which has higher priority with ring0 code.

The variation range when multiple passes were handled to stabilise it was about 3 to 5 percent which reduces the effectiveness in terms of accuracy. If you wish to test code in ring0, you will not get the variation but it is a lot of messing around to do it and it is not an accurate measure of how the code will run in ring3 with other things running as well.

One of the tricks is to run a sequence of CPUID instructions before RDTSC as it fully flushes the cache but the percentage wander in output is still there.

This is why I opt for using a lower resolution timing technique but run it for a half a second or more so that the percentage comes down under 1 percent.

Regards,

hutch@movsd.com
Posted on 2001-12-01 14:37:44 by hutch--
Thanks for all the replies. I tweaked the macros a bit and now they get very accurate clock cycle timings even in windows ring3 code. The macros in the attached file aren't bad with today's cpus.

It seems that you should always throw away the first value to take into account switching processor modes, cache prefetches, and whatnot ::)

E?in: The code you posted from agners help file isn't bad but the clc instruction poses a problem - it isn't pairable with itself, but can pair with other instructions. The cpuid instruction doesn't pair with anything afaik on today's architectures.

Anyway I tested some fpu operations (fdiv) which take ~3300 clocks on my system (1.4 Ghz) followed by simple instructions which should pair:
xor eax,eax (1 clock) ...
xor eax,eax \ xor eax,eax (1 clock) ...
xor eax,eax \ xor eax,eax \ xor eax,eax (2 clocks) ...

So the following messy but commented code looks like an accurate way to test simple instructions :alright:

btw: the debug macros in masm32v7 are excellent. Appreciate all the work that went into the package.
Posted on 2001-12-01 16:48:10 by grv575