greetz all

i was messing around with plain 32-bit protected mode code to determine instruction timings of a piece of code i had written using the pentium RDTSC instruction. here is a fragment of my code which does a trivial measurement---

trial:
ret

start:
db 0fh, 31h ;RDTSC
mov , eax ;read start counter
mov , edx

call trial

db 0fh, 31h ;RDTSC
mov , eax ;read end counter
mov , edx

this code gave me a difference of 10h (16 clock cycles) neglecting the reading instructions themselves. i did this code under plain old DOS using my own small 32-bit protected mode shell with all interrupts disabled so i knew that this section of the code was the only section executing (my idt was not initialized, so no question of any irq either).

now comes the wierd part. i just made a small change to simulate the fact that trial was located a bit far from this call location ---

trial:
ret
align 4
db 8192 dup(0) ; roughly 2 page sepearation

start:
db 0fh, 31h ;RDTSC
mov , eax ;read start counter
mov , edx

call trial

db 0fh, 31h ;RDTSC
mov , eax ;read end counter
mov , edx

and holy moly my counter reading was 120h again neglecting the reading instructions themselves. this is really wierd. my question is

"does the instruction clock cycle for the near call instruction in 32-bit mode depend on how far the destination address of the call is ?"

i know i am going wrong somewhere. i hope i am wrong :) for the sake of computing saneness. i would request anyone to plz shed some light on this issue.
i am having the same problem with the far/near jump too.

awaiting a reply

best regards
jmpf00d
Posted on 2003-06-30 11:10:11 by jmpf00d
If the code is not in the cache then you are timing your memory. ;) You could confirm this by executing the code twice and only timing the second run. Also, don't forget to use a serializing instruction (CPUID) to clear the instructions out of the CPU.
Posted on 2003-06-30 11:43:04 by bitRAKE
greetz

yep indeed, i knew i was doing something wrong. it was due to a branch prediction penalty.

i changed the code to this ---

trial:
ret
inc eax ; 1 micro-op instruction uv paired
inc eax
inc eax
inc eax
db 8192 dup(90h) ; nops


thats it, guess what i got the same 10h for my timing ;)

well, i would like to know more about how exactly u can place ur code so that it falls in the cache of the CPU. i know u need to mess around with some jmps to bring instructions in the cache, but i would like to have a small tutorial on that if anyone can provide the same ;)

thnx a lot

best regards
jmpf00d
Posted on 2003-06-30 11:54:07 by jmpf00d
but i would like to have a small tutorial on that if anyone can provide the same


maybe you'll find something at Agner Fog's "How to optimize for the Pentium? microprocessors ".
It deals with some cache-issues.
Hope you'll find what you're lookin for :)

/edmund
Posted on 2003-06-30 17:13:47 by edmund
greetz edmund

thnx for the url, but i got that one already ;).

yeah, i understand now that the BTB and the serialization play a major role in the timing of the isntructions. i am experimenting now with various options, and will get back with anything wierd i notice.

thnx once more.

best regards
jmpf00d
Posted on 2003-06-30 21:37:12 by jmpf00d
greetz all

well i performed some tests and again i have some wierd outputs which i wish to clarify.

i was measuring the instruction cycle for a far jmp in 32-bit protected mode. my construct was this ---

db 0eah ; far jump opcode
dd offset mypack ; offset
dw code32_idx ; selector for CS
nop

mypack:

now i got a measure of around 20 clock cycles for this. however when i changed it to this --

db 0eah ; far jump opcode
dd offset mypack ; offset
dw code32_idx ; selector for CS
nop
db 16384 dup(90h)
mypack:

just to simulate a far jmp to a location which is farther. the clock cycles shot up to
1184.

again, this clock cycle depended on the db xxxx dup(90h), if i changed it from 16384 to 8192 it reduced to about 850 clock cycles. i am really confused as to how an instruction timing can differ depending on the destination address. can anyone throw some light on this.

i use the CPUID to serialize but i measure the worst case performace in a single iteration. i want to measure the worst case performance so i dont want it to be in the cache :). and the above is what i got. varying clock cycles depending upon the destination address of the far jmp.

awaiting a reply.

best regards
jmpf00d
Posted on 2003-06-30 22:44:30 by jmpf00d