greetz all
i was messing around with plain 32-bit protected mode code to determine instruction timings of a piece of code i had written using the pentium RDTSC instruction. here is a fragment of my code which does a trivial measurement---
trial:
ret
start:
db 0fh, 31h ;RDTSC
mov , eax ;read start counter
mov , edx
call trial
db 0fh, 31h ;RDTSC
mov , eax ;read end counter
mov , edx
this code gave me a difference of 10h (16 clock cycles) neglecting the reading instructions themselves. i did this code under plain old DOS using my own small 32-bit protected mode shell with all interrupts disabled so i knew that this section of the code was the only section executing (my idt was not initialized, so no question of any irq either).
now comes the wierd part. i just made a small change to simulate the fact that trial was located a bit far from this call location ---
trial:
ret
align 4
db 8192 dup(0) ; roughly 2 page sepearation
start:
db 0fh, 31h ;RDTSC
mov , eax ;read start counter
mov , edx
call trial
db 0fh, 31h ;RDTSC
mov , eax ;read end counter
mov , edx
and holy moly my counter reading was 120h again neglecting the reading instructions themselves. this is really wierd. my question is
"does the instruction clock cycle for the near call instruction in 32-bit mode depend on how far the destination address of the call is ?"
i know i am going wrong somewhere. i hope i am wrong :) for the sake of computing saneness. i would request anyone to plz shed some light on this issue.
i am having the same problem with the far/near jump too.
awaiting a reply
best regards
jmpf00d
i was messing around with plain 32-bit protected mode code to determine instruction timings of a piece of code i had written using the pentium RDTSC instruction. here is a fragment of my code which does a trivial measurement---
trial:
ret
start:
db 0fh, 31h ;RDTSC
mov , eax ;read start counter
mov , edx
call trial
db 0fh, 31h ;RDTSC
mov , eax ;read end counter
mov , edx
this code gave me a difference of 10h (16 clock cycles) neglecting the reading instructions themselves. i did this code under plain old DOS using my own small 32-bit protected mode shell with all interrupts disabled so i knew that this section of the code was the only section executing (my idt was not initialized, so no question of any irq either).
now comes the wierd part. i just made a small change to simulate the fact that trial was located a bit far from this call location ---
trial:
ret
align 4
db 8192 dup(0) ; roughly 2 page sepearation
start:
db 0fh, 31h ;RDTSC
mov , eax ;read start counter
mov , edx
call trial
db 0fh, 31h ;RDTSC
mov , eax ;read end counter
mov , edx
and holy moly my counter reading was 120h again neglecting the reading instructions themselves. this is really wierd. my question is
"does the instruction clock cycle for the near call instruction in 32-bit mode depend on how far the destination address of the call is ?"
i know i am going wrong somewhere. i hope i am wrong :) for the sake of computing saneness. i would request anyone to plz shed some light on this issue.
i am having the same problem with the far/near jump too.
awaiting a reply
best regards
jmpf00d
If the code is not in the cache then you are timing your memory. ;) You could confirm this by executing the code twice and only timing the second run. Also, don't forget to use a serializing instruction (CPUID) to clear the instructions out of the CPU.
greetz
yep indeed, i knew i was doing something wrong. it was due to a branch prediction penalty.
i changed the code to this ---
trial:
ret
inc eax ; 1 micro-op instruction uv paired
inc eax
inc eax
inc eax
db 8192 dup(90h) ; nops
thats it, guess what i got the same 10h for my timing ;)
well, i would like to know more about how exactly u can place ur code so that it falls in the cache of the CPU. i know u need to mess around with some jmps to bring instructions in the cache, but i would like to have a small tutorial on that if anyone can provide the same ;)
thnx a lot
best regards
jmpf00d
yep indeed, i knew i was doing something wrong. it was due to a branch prediction penalty.
i changed the code to this ---
trial:
ret
inc eax ; 1 micro-op instruction uv paired
inc eax
inc eax
inc eax
db 8192 dup(90h) ; nops
thats it, guess what i got the same 10h for my timing ;)
well, i would like to know more about how exactly u can place ur code so that it falls in the cache of the CPU. i know u need to mess around with some jmps to bring instructions in the cache, but i would like to have a small tutorial on that if anyone can provide the same ;)
thnx a lot
best regards
jmpf00d
but i would like to have a small tutorial on that if anyone can provide the same
maybe you'll find something at Agner Fog's "How to optimize for the Pentium? microprocessors ".
It deals with some cache-issues.
Hope you'll find what you're lookin for :)
/edmund
greetz edmund
thnx for the url, but i got that one already ;).
yeah, i understand now that the BTB and the serialization play a major role in the timing of the isntructions. i am experimenting now with various options, and will get back with anything wierd i notice.
thnx once more.
best regards
jmpf00d
thnx for the url, but i got that one already ;).
yeah, i understand now that the BTB and the serialization play a major role in the timing of the isntructions. i am experimenting now with various options, and will get back with anything wierd i notice.
thnx once more.
best regards
jmpf00d
greetz all
well i performed some tests and again i have some wierd outputs which i wish to clarify.
i was measuring the instruction cycle for a far jmp in 32-bit protected mode. my construct was this ---
db 0eah ; far jump opcode
dd offset mypack ; offset
dw code32_idx ; selector for CS
nop
mypack:
now i got a measure of around 20 clock cycles for this. however when i changed it to this --
db 0eah ; far jump opcode
dd offset mypack ; offset
dw code32_idx ; selector for CS
nop
db 16384 dup(90h)
mypack:
just to simulate a far jmp to a location which is farther. the clock cycles shot up to
1184.
again, this clock cycle depended on the db xxxx dup(90h), if i changed it from 16384 to 8192 it reduced to about 850 clock cycles. i am really confused as to how an instruction timing can differ depending on the destination address. can anyone throw some light on this.
i use the CPUID to serialize but i measure the worst case performace in a single iteration. i want to measure the worst case performance so i dont want it to be in the cache :). and the above is what i got. varying clock cycles depending upon the destination address of the far jmp.
awaiting a reply.
best regards
jmpf00d
well i performed some tests and again i have some wierd outputs which i wish to clarify.
i was measuring the instruction cycle for a far jmp in 32-bit protected mode. my construct was this ---
db 0eah ; far jump opcode
dd offset mypack ; offset
dw code32_idx ; selector for CS
nop
mypack:
now i got a measure of around 20 clock cycles for this. however when i changed it to this --
db 0eah ; far jump opcode
dd offset mypack ; offset
dw code32_idx ; selector for CS
nop
db 16384 dup(90h)
mypack:
just to simulate a far jmp to a location which is farther. the clock cycles shot up to
1184.
again, this clock cycle depended on the db xxxx dup(90h), if i changed it from 16384 to 8192 it reduced to about 850 clock cycles. i am really confused as to how an instruction timing can differ depending on the destination address. can anyone throw some light on this.
i use the CPUID to serialize but i measure the worst case performace in a single iteration. i want to measure the worst case performance so i dont want it to be in the cache :). and the above is what i got. varying clock cycles depending upon the destination address of the far jmp.
awaiting a reply.
best regards
jmpf00d