All Pentium and later processors have an instruction RDTSC, which returns the number of clock cycles since last reset in EDX (high order 32 bits) and EAX (low order 32 bits). This may be how you can check "How long has this computer been on?" More importantly for our purposes, it is how we can check how long a series of instructions takes in clock cycles.

I enclose a simple program in HLA below. The main procedure is written in MASM and enclosed by #asm & #endasm. It uses RDTSC twice before and after the instruction sequence to be measured.

The clock cycles consumed in one RDTSC will contribute towards the total clock cycles measured. On the Pentium MMX this appears to be 13 cycles. The 2 pushes take 1 cycle in parallel.

The program prints the number of cycles high order 32 bits first, then low order 32 bits.

Of interest. The largest 32 bit number 2^32-1 in clock cycles is just over 1.4 seconds on a 3.0 GHz processor. On my 166 MHz Pentium MMX, it is 25 seconds, an eternity in computer time.

I use HLA because I understand the convention for the API calls I use (console mode functions provided by HLA - stdout, etc). I invite anyone to provide a conversion to MASM32 or NASM.

Thanks much.



// RDTSC Check Program by V Coder
// Written in HLA
//

program RDTSCCheck;
#include( "stdlib.hhf" );


begin RDTSCCheck;

console.cls();
console.gotoxy(4, 15);
stdout.put ( "RDTSC Check program.", nl nl);

// Main routine starts here

#asm
rdtsc ; First measure of time
; rdtsc takes 13 cycles on Pentium MMX,
push edx ; I store it on the stack.
push eax ; You can store the 64 bit number whereever you like

; routine to test
nop
nop
nop
nop
; end routine

rdtsc ; Second measure of time
sub eax, [esp] ; subtract first from second
sbb edx, [esp+4] ; result in EDX:EAX

; sub eax, 0eh ; Optional compensation for the rdtsc and 2 pushes
; sbb edx, 0 ; 14 cycles on a Pentium MMX
; 9 cycles on a K6-2

add esp,8 ; remove edx, eax from stack
#endasm

stdout.put ( edx," ",eax, " clocks.");
end RDTSCCheck;

Posted on 2003-04-28 22:29:58 by V Coder
Okay, can anyone please tell me how to retain spaces in program code when I post, so it does not just left justify everything?

Thanks.
Posted on 2003-04-28 22:34:56 by V Coder
V,

To retain the formatting in you examples, use the following format.


[xxx]
Your code
[/xxx]


Where "xxx" is the word "code".

Now with your comments on using RTDSC, I in fact agree with you that real time testing is the most useful in algorithm design, even if you use other methods in the development of the code.

What you need to be aware of is the effects of running ordinary application code in what is called ring3 access. Privilege levels in Intel processors are used by the operating system to control what can be run with priority and what cannot.

When you run RTDSC in ring3 it is subject to interference from the operating system that has priority over ring3 code so you tend to get fluctuation in the results. My own testing shows 2-3 % which is enough to remove the advantage of the accuracy of RTDSC.

I usually use the simpler GetTickCount() which only has millisecond resolution and suffers from the same fluctuation as RTDSC but I run it over a large enough sample to get the error down to 1% or less.

Regards,

hutch@movsd.com
Posted on 2003-04-29 02:56:49 by hutch--
You'd still get more correct timings from rdtsc + "larger amount of iterations". It means you will have to deal with a little 64bit math since you can't just discard edx, but that's no major problem. Oh, it's possible to disable rdtsc access from ring3 programs - but all OSes I've seen let ring3 do rdtsc (and why shouldn't they?)
Posted on 2003-04-29 03:26:44 by f0dder
This does not solve the problem of fluctuation in ring3. 2-3% fluctuation renders the accuracy useless so there is no gain doing it at a ring3 level.

Increasing the size of the sample reduces the error percentage to whatever level you like and getTickCount is easier to use when 64 bit resolution is messed up to the extent that ring3 fluctuation effects.

Regards,

hutch@movsd.com
Posted on 2003-04-29 03:41:19 by hutch--
The key to using RTDSC is the setup (inst/data cache & alignment). A large number of interrations are not needed. It is only effective for small (timing wise) pieces of code that avoid the ring3 fluxuations. In some cases multiple samples can be made, and all but the minimum is discarded. This minimum is the exact cycle count in most of my tests (Athlon).
Posted on 2003-04-29 03:50:50 by bitRAKE
Boost process + thread priority to realtime, and fluctuation shouldn't be that bad. Furthermore, rdtsc gives fine-grained results, so if your code isn't too long you can probably even avoid thread switching (since you don't need many iterations to be able to compute a result).

I'll have a look-see at all this when I implement rdtsc timing in the yodel bench. I expect it to work nicely.

Ah, rake beat me, and with a nicer phrasing.
Posted on 2003-04-29 03:59:29 by f0dder
Hutch,

Thanks much. I will start using
 
from now.

Rake,

I like that solution! I prefer RDTSC because it allows accurate measuring over a smaller number of iterations, so the testing can be done quickly. By repeating the tests, and discarding all but the minimum timing, the effect of the operating system ring3 access, etc is eliminated.

It seems that millisecond timing with ticks is necessary only for C++ programmers who do not have direct access to the processor RDTSC instruction.

Since assembly programming allows better, efficient use of processor resources, a million iterations of a routine just for timing with ticks is excessive and unnecessary. Plus it makes those with slower processors wait too long to get the same results as would be obtained with RDTSC over a smaller number of iterations.
Posted on 2003-04-29 08:27:44 by V Coder
Anyone

Please convert the above code to MASM format so I can get an idea of how to do so...

Thanks
Posted on 2003-04-29 08:34:03 by V Coder

It seems that millisecond timing with ticks is necessary only for C++ programmers who do not have direct access to the processor RDTSC instruction.

I don't know of any modern C/C++ compiler that won't let you use inline asm or link to external asm routines - heck, even ancient compilers let you do that ^_^. Choosing GetTickCount instead of rdtsc, tjah... there isn't really much excuse.
Posted on 2003-04-29 08:43:05 by f0dder


.386
.model flat, stdcall
option casemap:none

include /masm32/include/windows.inc
include /masm32/include/kernel32.inc
include /masm32/include/user32.inc
includelib /masm32/lib/kernel32.lib
includelib /masm32/lib/user32.lib
.data
format db "%d %d clocks",0
.data?
buffer db 64 dup (?)
hOutput dd ?
written dd ?

.code

invoke GetStdHandle, STD_OUTPUT_HANDLE
mov hOutput, eax
rdtsc ; First measure of time
; rdtsc takes 13 cycles on Pentium MMX,
push edx ; I store it on the stack.
push eax ; You can store the 64 bit number whereever you like

; routine to test
nop
nop
nop
nop
; end routine

rdtsc ; Second measure of time
sub eax, [esp] ; subtract first from second
sbb edx, [esp+4] ; result in EDX:EAX

; sub eax, 0eh ; Optional compensation for the rdtsc and 2 pushes
; sbb edx, 0 ; 14 cycles on a Pentium MMX
; 9 cycles on a K6-2

add esp,8 ; remove edx, eax from stack
invoke wsprintf, offset buffer, offset format, edx,eax
invoke WriteConsole,hOutput, offset buffer, sizeof buffer,offset written,0

untested code
Posted on 2003-04-29 08:53:15 by roticv
Thanks much, roticv. I'll check it out.

(Yes it looks more painful than programming in HLA. But I suppose medicine is supposed to taste bad if it is to make you get better.)
Posted on 2003-04-29 09:00:02 by V Coder

It seems that millisecond timing with ticks is necessary only for C++ programmers who do not have direct access to the processor RDTSC instruction.
Both methods are useful - some tests don't lend themselves well to cycle counts. IMO it is not a programming langauge choice, so much as it is a CPU/OS/test code induced choice.
Posted on 2003-04-29 10:09:21 by bitRAKE
Question:
How does the Operating system determine the number of ticks? Wouldn't the OS have to do calulations based on RDTSC as well?

The calculation, of course, would take a small amount of time... Plus, the cycle counts with RDTSC are accurate to a cycle, whereas the ticks are accurate to a millisecond (> 1 million cycles on a 1GHz machine). With ticks, the only way to obtain similar accuracy as RDTSC is to use millions of iterations. Some tests actually use billions of iterations.

With ticks you have to measure the processor speed before you can determine the number of cycles.

With RDTSC, the processor speed is irrelevant. Way to go, RDTSC... :alright:
Posted on 2003-04-29 10:21:32 by V Coder
humm... the PC architecture has an interrupt timer. I think most OSes would use this - constantly polling RDTSC would be slow.

Also, iirc the cycle counter reported by RDTSC can have problems especially on laptops with power saving modes; I don't think the interrupt timer would be affected by this.
Posted on 2003-04-29 10:26:23 by f0dder
He did some profiling work a while back.
http://www.asmcommunity.net/board/index.php?topic=7510&highlight=profile+code

Regards, P1
Posted on 2003-04-29 14:02:24 by Pone
yup, and I'm probably going to look into maverick's stuff, too. For my bechmarking, initially, I want to develop my own stuff to get a good feel for it all, though. And before ultra-precise timing is implemented, other issues like linux compatibility etc.

thanks for the link.
Posted on 2003-04-29 14:04:25 by f0dder
What is the interrupt timer please? Is that the System interrupt that occurs 18.1 times per second to refresh system memory, a hold over from the mid 80s when memory would lose its data if not given a specific instruction to perform a refresh? .
Posted on 2003-04-29 21:34:15 by V Coder
I dunno if the timer (8253 PIT) was used for refreshing memory, but if it was, I doubt it's doing that today ^_^. 18.1hz (or was it 18.2? I think so) sounds like the frequency the timer ran under dos - however, it can be programmed to around 1MHz if I am not mistaken. Also, the PIT should be drive by external circuitry if I am not mistaken,and thus should not be affected by notebook powersaving stuff like rdtsc has been reported to be.
Posted on 2003-04-30 02:27:18 by f0dder
the PC architecture has an interrupt timer. I think most OSes would use this - constantly polling RDTSC would be slow.


I wonder... Wouldn't the OS routine that determines ticks only poll RDTSC when it is called, not constantly? On the other hand, the system interrupt timer, if it was set for 1000/sec would have to execute some code to count milliseconds, which would add thousands of clock cycles per second to run time.



hutch,

I agree that there is tremendous variability with the time offered by RDTSC, especially for routines that should take lets say 200 cycles... I am getting wide variation.... Still in testing phase now though... I hope to get a fix on the ideal number of iterations to mask that effect.



roticv,

I tried the code, but it gives an error on compile. I put in the "end", but I am not proficient in MASM to determine what else was cut off from the bottom.
Posted on 2003-04-30 07:21:02 by V Coder