cpuid
    rdtsc
    mov timer,eax
   
    nop
    nop
    nop

    cpuid
    rdtsc
    sub eax,timer
    ret


timer:
        dword 0

Take time on 3 nop:s
Different eax each time why, what have i missed?
Posted on 2008-02-14 11:06:10 by sittingduck
Cache.
"mov timer,eax" could take 0.1 cycles, but also could take 500.

That's why there's a term "warming-up the caches" when benchmarking/profiling code.

Also, "cpuid" before "rtdsc" is not necessary.

Also, there's a quirk of paging, that memory isn't actually physically allocated until you access it.
So, either measure the timing of a loop (looping 100,000 - 100,000,000 times, you choose), or first make sure you've pre-accessed all of the necessary memory.
Posted on 2008-02-14 14:38:22 by Ultrano

Also, "cpuid" before "rtdsc" is not necessary.


cpuid is a serializing instruction, it is necessary to prevent out of order execution of rdtsc on P6 series CPU's. Also so you stand less chance of a context switch during the test you should be setting the thread priority...

invoke SetPriorityClass,, REALTIME_PRIORITY_CLASS
invoke SetThreadPriority,, THREAD_PRIORITY_TIME_CRITICAL
Posted on 2008-02-14 18:44:06 by donkey
Also, things like intel speedstep or AMD cool-n-quiet could be lowering your CPU frequency, you need to keep that in mind as well, and do a little CPU-intensive "warm-up" before profiling.

Also, set thread affinity to work around RDTSC bugs in AMD CPUs.

And only use rdtsc for profiling, never for timing in production code.
Posted on 2008-02-14 20:22:29 by f0dder
And only use rdtsc for profiling, never for timing in production code.

why? exactly because of current variable megahurtz?

i think there are some code "out there" that use gettimestampcounter for timing...
would it be broken since the time things are this way?

creepy  :shock:

so whats to use? win32 timers?

Posted on 2008-02-15 16:41:42 by HeLLoWorld
Yes, variable MHz, and unsynchronized TSC values (in dualcore). With the first problem you could incorrectly measure some proc as being slower than another (until the MHz kick-in), and make your app use the actual slower version. With the second problem, you can get negative difference between time0 and time1 in RDTSC.

There was a discussion of this on VirtualDub's forums. Using the mm timer seems to be best practice (when timing audio and video streams) , though it takes 1000 cycles, as the 32768 Hz realtime-clock is queried. I don't recall if there the problems were present on some laptops, thanks to awful hardware/bios. Supposedly, MS fixed it all with relevant OS updates (except for those laptops) - search for msdn articles about it, too (I read it in the Knowledge Base section, iirc)

Btw, GetTickCount() simply returns a pre-cached value, that is set by the thread-scheduler when switching in response to the timer-interrupt (16.6ms granularity on my system, for instance).
Posted on 2008-02-15 17:02:55 by Ultrano

Yes, variable MHz, and unsynchronized TSC values (in dualcore). With the first problem you could incorrectly measure some proc as being slower than another (until the MHz kick-in), and make your app use the actual slower version. With the second problem, you can get negative difference between time0 and time1 in RDTSC.

...and that is why all Unreal engine games crash on AMD64x2, bitching about negative time delta :)

Also, on the dualcore AMD machines, QueryPerformanceCounter seems to use RDTSC, at least it exhibits the same problems as using RDTSC.

AMD released a fix driver that periodically synchronizes the TSCs (are those writable through MSRs? how messed up is that? >_<), and call it a "processor optimization driver", instead of labelling it as a bugfix...

On the intel machines I've tested on (haven't on my new quadcore box yet), QueryPerformanceCounter didn't seem like RDTSC timing, but more like a 1000Hz accuracy timer. PIT? APIC/whatever timer?
Posted on 2008-02-15 18:32:19 by f0dder
oh the agony! PCs are not custom fixed hardware consoles anymore... have they ever been?
and software lasts longer than hardware generations...

soon hardware will not be made upon design efficiency choices, but to best match existing codebase... core2 is just this already.
Posted on 2008-02-15 18:59:07 by HeLLoWorld
Humm, I think core2 is more than just "best codebase match" - it seems pretty darn nice overall, and the new SSE stuff it adds certainly isn't for existing codebases :)

But OK, if we look aside x86 I think we could have a lot more efficient CPUs, but that just isn't going to happen, ever... x86-64 ruined that daydream :)
Posted on 2008-02-15 19:06:57 by f0dder
yes...i guess you're right...

as an aside note, do you people heard about configware? i just stumbled upon that the other night on wikipedia and it blew my mind...so much potential...

edit : going to create a topic in the heap just for that. thats what i say.
Posted on 2008-02-15 20:14:41 by HeLLoWorld
So is it safe to say that real accurate timings cannot be done? :shock:
Posted on 2008-02-18 13:53:13 by sittingduck
Oh, it can be done really accurately - on my Sempron, single-core, no variable clock :D. And on systems like that.
Posted on 2008-02-18 14:51:18 by Ultrano

Oh, it can be done really accurately - on my Sempron, single-core, no variable clock :D. And on systems like that.

Unless you trigger SMM? ;)
Posted on 2008-02-18 17:55:43 by f0dder