Hi KetilO, you wrote:
Finally some mature response.
[..]
To all:
I started this thread hoping for a mature discussion leading to a result, but as often happends on this board someone has to show off and use 'war types' to heat up the discusssion and thereby destroying the possibility of a result. It's a pity. I am here to learn and share my ideas.


Please don't take the "it was personal", victimistic road now.

The reason why I got a bit pissed because of your behaviour was that I several times wrote you about some issues that you in the following posts completely ignored. I'm sure you would have been annoyed by this behaviour too, if somebody asked about some RadAsm problem, you replied in detail, and in his next post he looked deaf. It is annoying, considering the very little free time I can dedicate to the board and to help (something I gladly do, anyway).

That's the whole point, I've very little free time and I'd like to help, and thus to get listened. Anyway I'm working on a "dramatic" solution for PROFILE (more about this later).

About "mature discussion", thanks for finally sharing what CPU was the one that gave you problems. The next release of PROFILE (which will be released as machine code, thus compatible with all assemblers, MASM included) will be extremely careful also to cache trace and other possible problems related with Windows and what not. I'm writing the next release of PROFILE in a "be paranoid" way, so to cover all the unprobable and inimaginable causes of problems, and to ensure MASM compatibility as well.

I will document the advances when I post the new routine, hoping the documentation will be read this time. ;)

Then please write a MASM sample application and report any inconvenience on the CPU's you have access to, eventually reporting the brand/model, etc..

And please don't take this as personal.. I got heated only since the 4th time I repeated the same things on. Try to see it from the other point of view as well, I have nothing against you, never had, and don't want to have. I want to help but I don't want to waste my time.. simple as that.


ALL: Do you know of a single CPU where if CPUID reports the presence of CMOV one can NOT assume anyway the presence of SFENCE? The next release of PROFILE will do sfence (to improve consistency on some memory intensive routines), and I need a reliable way to detect its presence (should be on all CPU's that execute out of order, which should all also support CMOV. Do you know of any exception?).
Posted on 2002-08-14 08:07:52 by Maverick
Hi Maverick

Althoug not clearly stated the request was targeted to masm users that might had a solution. I don't know why you selected to assume that I had not read your instructions or not fully understood the importance of alignment (after reading it).

KetilO
Posted on 2002-08-14 08:35:12 by KetilO
Hi KetilO,

The problem is wider than that.

For example, on the K7 a stall will happen if:



OR EAX,12
CMP EAX,128

<edit: the above alone will not produce the stall, so I better show a full test as well:>

CMP ECX,'g'-'0'
JAE .exit
MOV EAX,$12345678
CMP ECX,128 ; try 32 instead.. and the stall arises (10 cycles of stall!)
CMOVC EAX,EBX
BT EAX,ECX
JNC .exit


Simply because the former instruction immediate operand can be codified as byte size, while the latter not. Weird, ain't it. This kind of unknown_by_the_most stall instead won't happen on Intel CPU's, AFAIK.

Many assemblers won't let you have control over the (short vs long) form of the instruction that will be produced, while others will, and e.g. FAsm will by default find/use the shortest possible form, but will let you specify what you prefer, in case.

PROFILE is extremely tricky in this regard.. if an instruction is assembled a bit different than I thought, the results may be much different. These things aren't obvious, by any means. That's why I insisted on the importance of not assuming anything (about alignment or anything else), although any small difference may seem to not change anything (while indeed it may do).

That's why I decided, for the next release, to make a machine code version.. so to provide assembler independent code, and be sure all works as expected. Also, I'm glad if MASM users can use it, although I don't like that assembler.

I will finish and release it when I have my next free time slice.
Posted on 2002-08-14 08:45:19 by Maverick
Hi Maverick

Thanks, :alright:

It is a gereat tool, still a bit tricky to use under masm tho.

KetilO
Posted on 2002-08-14 08:50:54 by KetilO
You're wellcome.. but, believe me, MASM has its limitations, it's not that I'm not supporting it. MASM is good as a HLL tool, but when you want to do stuff really low level, it has many limitations.

I'm trying to circumvent all of them anyway, and the next release of PROFILE should not be too difficult to use with any assembler, although being the tricky tool that it is, you're right, it's not too trivial to use, unfortunately. Nothing can be made to improve this, though, I'm afraid. It's a tricky thing to count the exact number of cycles an arbitrary routine takes on a modern CPU.
Posted on 2002-08-14 09:10:34 by Maverick

ALL: Do you know of a single CPU where if CPUID reports the presence of CMOV one can NOT assume anyway the presence of SFENCE? The next release of PROFILE will do sfence (to improve consistency on some memory intensive routines), and I need a reliable way to detect its presence (should be on all CPU's that execute out of order, which should all also support CMOV. Do you know of any exception?).

The SSE instruction set include SFENCE. You need check "SSE Extensions" flag of CPUID.

About you stall example. All stall situation described in CPU documentation (AMD/Intel). I dont have K7 doucumentation. But you example look strange. May be such stall specific only for K7? Very intresting where you become known about this.
Posted on 2002-08-14 11:34:06 by Nexo
I don't have my docs here.. but doesn't the Pentium II support SFENCE?
Are you 100% sure that it is a SSE extension? What about Athlons then.. no SSE (but probably XMMX, i.e. MMX SSE extensions), but SFENCE support. A 100% sure, final word on this would be appreciated.

About the weird stall, I came to it by experience, but I've explicitly read about this somewhere too (K6/K7).

Play with my example.. you'll see it's a short vs long form stall. The 1st and 2nd CMP produce the stall in a "xor" fashion.. both short = no stall, both long = no stall, otherwise stall.
Posted on 2002-08-14 12:59:09 by Maverick
For Athlon SFENCE included in MMX Extension (22007.pdf, page 293). Detect by test 22bits of feature flags (20734.pdf, page 13).
I play with example. Every stall must have a causality. But realy cause this stall in long/short args?
I start from this code:


...
PROFILE Test1
...
align
Test1:
mov ecx,1
CMP ECX,124
JAE exit
MOV EAX,123
CMP ECX,124333
CMOVC EAX,EBX
BT EAX,ECX
JNC exit
exit: ret

Clocks=3
Then made add one NOP after Test1:
Clocks=32
Add also one NOP
Clocks=5
..31,5,32,6,33,6...
Hmm.. What you can say? It only with my CPU (AthlonXP)? Anyone check this, please.
Posted on 2002-08-14 15:33:32 by Nexo

I don't have my docs here.. but doesn't the Pentium II support SFENCE?


Pentium III is the first Intel processor that has sfence. Before Pentium III, Intel did not need sfence at all, because Intel's weakly ordered memory write became available (at least, to asm programmers) only with Pentium III.
Posted on 2002-08-14 17:54:50 by Starless

What you can say? It only with my CPU (AthlonXP)? Anyone check this, please.
Oh, yes we are not alone. :) Maybe, now you can understand some of my comments in previous test cases and discussions with Svin. These are very funny processors and we merely probe their inner workings with such code, imho. Here are my results:
	;Total time (Athlon TB):

; 2 or 3 (no nop's)
nop ; 3 or 31 (one nop)
nop ; 4 (...etc.)
nop ; 4
nop ; 6
nop ; 6 or 34
nop ; 4
nop ; 4 or 33
nop ; 6 or 10
nop ; 5 or 34
nop ; 6 or 8
nop ; 7 or 35
mov ecx,1
CMP ECX,124
JAE exit
MOV EAX,123
CMP ECX,124333
CMOVC EAX,EBX
BT EAX,ECX
JNC exit
exit: ret
These are very funny results, but highly reproducable! :grin:

Further tests would need to be made to exclude PROFILE being the source of error, but I do not have the time right now.
Posted on 2002-08-14 21:52:45 by bitRAKE
bitRAKE, I known cause this stall :) Remember following rule and you save several clocks ;)


[B]jnc[/B] exit
nop ; K7 branch predict doesn't like RET after Jcc
exit:
[B]ret[/B]

But for me still strange Maverick's description stalls. May be it some other. Last example unfit for research this. Maverick, can you make other example?
Posted on 2002-08-15 10:48:21 by Nexo
Nexo, does not change much. New results:
nop	; 7 or 11

nop ; 8 or 21
nop ; 5
nop ; 6 or 20
nop ; 6
nop ; 5 or 18
nop ; 6
nop ; 6 or 20
nop ; 3 or 4
nop ; 4
nop ; 3
; 3 or 15
mov ecx,1
CMP ECX,124
JAE exit
MOV EAX,123
CMP ECX,124333
CMOVC EAX,EBX
BT EAX,ECX
JNC exit
nop ; K7 branch predict doesn't like RET after Jcc
exit: ret
Posted on 2002-08-15 22:40:35 by bitRAKE
Hi Nexo: the stalls changing with NOPs to me look like even/odd alignment issues that cause problems to the instruction decoders.

About the short vs long stall instead, the following is from an older AMD optimization document (21924):


Avoid superset dependencies ? Using the larger form of
a register immediate after an instruction uses the smaller form
creates a superset dependency and prevents parallel execution.
For example, avoid the following type of code:
OR AH,07h
ADD EAX,1555555h
One method for avoiding superset dependencies is to schedule
the instruction with the superset dependency (for example, the
ADD instruction) 4?6 instructions later than would normally be
preferable. Another method, useful in some cases, is to use the
MOVZX instruction to efficiently convert a byte-size value to a
doubleword-size value, which can then be combined with other
values in 32-bit operations.


Although this should probably apply to the K6 only, I found that it (maybe in minor part, but still) applies to the K7 as well.

I also found some interesting branch cache misbehaviours with particular alignments, which in the very little time I had to investigate the problem, to me looked like some sort of undocumented intruction cache bank conficts.

Anyway, I'm extremely busy with job right now.. I will have to delay these experiments and the release of the new PROFILEr of some days, but it's rock solid.. just as example, here's an early excerpt from the new docs:

<<

This is the new, official, improved version of my PROFILE tool. It is now assembler-independent, FPU aware, and has several small other improvements. For example, under a preemptive multitasking OS such as Windows, our process may be switched off in the middle of the profiling job. This means that the profiler will return a wrong value, because of the OS interference.
Although rare to happen, we have to take into account this possibility, to offer as maximum reliability as possible, even without an human/intelligent interpretation of the results. The new PROFILEr detects and fixes automatically abnormal results/situations, so you can always be sure that what you get is what the CPU really spends on your test routine.

For example, this shows the consistency and precision of the new PROFILEr:


Test Code, CPU: Pentium Athlon
just a RET: 0 0 cycles
1 NOP + RET: 1 1 cycles
2 NOP + RET: 1 1 cycles
3 NOP + RET: 2 1 cycles
4 NOP + RET: 2 2 cycles
5 NOP + RET: 3 2 cycles
6 NOP + RET: 3 2 cycles
7 NOP + RET: 4 3 cycles
8 NOP + RET: 4 3 cycles
9 NOP + RET: 5 3 cycles
10 NOP + RET: 5 4 cycles
11 NOP + RET: 6 4 cycles
12 NOP + RET: 6 4 cycles


NOTE: for reasons I don't have the time to dig, new alignment rules is a page (4 KB), and must be followed.

>>

PS: so, which CPU's support SFENCE?

Intel: PIII and PIV, some Celerons (best detection: SSE bit?)
AMD: K7 and Duron only? If not, best detection method = ?)
Posted on 2002-08-16 03:52:02 by Maverick
Hi bitRAKE. I always receive with same clocks for each nop: 3,4,4,3,6,6,5,6,6,5,9,7... Here not jumping clocks.
I find source of Jcc/NOP/RETt stalls: AMD Library Reference [ftp://ftp.amd.com/pub/devconn/sdk/library.zip] - useful thing ;)
How you calculate clocks, PROFILE or other. I use different tools and appear similar results.
Posted on 2002-08-16 12:24:17 by Nexo

Hi Nexo: the stalls changing with NOPs to me look like even/odd alignment issues that cause problems to the instruction decoders.
Maverick, i simplify last example:


clc
jnc exit
;nop
exit: ret
and receive same result. All instruction decoding in DirectPath decoder three per cycle. Fetcher reads 16 bytes and can decode 24 bytes at all (no aligment problem here). The last code have less 16 bytes and can be easy decoded.
About the short vs long stall instead, the following is from an older AMD optimization document (21924):
...
Although this should probably apply to the K6 only, I found that it (maybe in minor part, but still) applies to the K7 as well.
Yes. It is also problem of Pentium Pro ... Pentium III processors (except Pentium4). It is described in docs Intel (24896602.pdf - Pentium 4) & AMD (22007.pdf - Athlon).
PS: so, which CPU's support SFENCE?

Intel: PIII and PIV, some Celerons (best detection: SSE bit?)
AMD: K7 and Duron only? If not, best detection method = ?)

if (Intel) SFENCE=CPUID(1).edx[25]
if (AMD) SFENCE=CPUID(80000001h).edx[22]
Posted on 2002-08-16 12:24:18 by Nexo

How you calculate clocks, PROFILE or other. I use different tools and appear similar results.
I use PROFILE above so we have comparable results, but I have my own profiling macros, too.
Posted on 2002-08-17 23:20:08 by bitRAKE
Maveric,
I wander if it's possible to make design of your PROFILE as
startPROFILE
endPROFILE
To measure code between to libaratly choosen points.
I still need such testing (if not say "mostly need").
And here is many ocasions when I coudn't test it as a piece of code with ret at the end.
Now, until the time, I use profile for testing snippents and procs, and use TimeTest_ON(OFF) for
described cases. I'd prefer PROFILE do it. What'd you say?
Posted on 2002-08-18 13:33:17 by The Svin
Hi The Svin,
I'll work on it.. although I can't guarantee anything because some months ago I investigated this interesting possibility, but the results weren't satisfactory. Modern (exp. K7 & P4) CPU's are extremely tricky, expecially if one has to do it in ring 3 (in ring 0 we'd have additional advantages like reading the TSC with a serializing instruction like RDMSR, while RDTSC is not serializing). I'll try again with more attention now. I have to take some more days, also because of a lot of job work to do, and of some kins at home for holiday.
Posted on 2002-08-19 04:15:22 by Maverick
Hi Maverick!

Your new version of the PROFILER is just great!
I was thinking about that same think, make an executable out of the FASM MACRO, and my idea was to use CreateProcess with the REALTIME_PRIORITY_CLASS flag on.
THat would probably work, because it starts a new whole page for the process, and then quickly returns control.
What do you think?



;
; Maverick's PROFILER executable
;

format PE GUI 4.0
entry start

include 'include\kernel.inc'
include 'include\user.inc'
include 'include\comdlg.inc'

include 'include\macro\stdcall.inc'
include 'include\macro\import.inc'


macro align value { rb (value-1) - ($ + value-1) mod value }


section '.data' data readable writeable

ofn OPENFILENAME

filter db 'Executable files',0,'*.EXE',0
db 'All files',0,'*.*',0
db 0

file_title rb 100h

_message db 'Cycles: '
.number db '0000000000',0
_caption db 'Maverick?s PROFILER',0

section '.code' code readable executable

start:

invoke GetModuleHandle,0
mov [ofn.hInstance],eax
mov [ofn.lStructSize],ofn.size
mov eax,NULL
mov [ofn.hwndOwner],eax
mov [ofn.lpstrFilter],filter
mov [ofn.lpstrCustomFilter],NULL
mov [ofn.nFilterIndex],1
mov [ofn.lpstrFileTitle],file_title
mov [ofn.nMaxFileTitle],100h
mov [ofn.lpstrInitialDir],NULL


invoke VirtualAlloc,0,1000h,MEM_COMMIT,PAGE_READWRITE
mov esi,eax
mov [ofn.lpstrFile],esi
mov [ofn.nMaxFile],1000h
mov byte [esi],0
mov [ofn.Flags],OFN_EXPLORER+OFN_FILEMUSTEXIST+OFN_HIDEREADONLY
invoke GetOpenFileName,ofn
or eax,eax
jz finish

mov dword [_PROFILE.ROUTINE],Address
call _PROFILE

;------------------------------------------------------------------------------
align 64

PROFILE.CYCLES: DD 0
DD 0
_PROFILE.EMPTY: DD 0 ; how many cycles it takes for a simple RET to be executed on the host CPU
DD 0
_PROFILE.ROUTINE: DD 0
_PROFILE.RETURN: DD 0
_PROFILE.IN.EAX: DD 0
_PROFILE.IN.EBX: DD 0
_PROFILE.IN.ECX: DD 0
_PROFILE.IN.EDX: DD 0
_PROFILE.IN.ESI: DD 0
_PROFILE.IN.EDI: DD 0
_PROFILE.IN.EBP: DD 0
_PROFILE.IN.EFL: DD 0
_PROFILE.OUT.EAX: DD 0
_PROFILE.OUT.EBX: DD 0
_PROFILE.OUT.ECX: DD 0
_PROFILE.OUT.EDX: DD 0
_PROFILE.OUT.EFL: DD 0
_PROFILE.RETADDR: DD 0

startupinfo STARTUPINFO
processinfo PROCESSINFO

; the following is to make sure that data and code are on a different page.
align 4096 ; note: *YOU* have to provide alignment
;------------------------------------------------------------------------------

align 64 ; align to a cache entry on all CPU's
_PROFILE:
MOV DWORD [_PROFILE.IN.EAX],EAX ; saves INPUT EAX (will be trashed by CPUID)
MOV DWORD [_PROFILE.IN.EBX],EBX ; saves INPUT EBX (will be trashed by CPUID)
MOV DWORD [_PROFILE.IN.ECX],ECX ; saves INPUT ECX (will be trashed by CPUID)
MOV DWORD [_PROFILE.IN.EDX],EDX ; saves INPUT EDX (will be trashed by CPUID)
MOV DWORD [_PROFILE.IN.ESI],ESI ; saves INPUT EAX (will be trashed by the routine to be tested, which will be called multiple times)
MOV DWORD [_PROFILE.IN.EDI],EDI ; saves INPUT EBX (will be trashed by the routine to be tested, which will be called multiple times)
MOV DWORD [_PROFILE.IN.EBP],EBP ; saves INPUT ECX (will be trashed by the routine to be tested, which will be called multiple times)
PUSHFD
POP DWORD [_PROFILE.IN.EFL] ; saves INPUT CPU EFLAGS
POP DWORD [_PROFILE.RETURN] ; saves return address
PUSH DWORD [_PROFILE.ROUTINE] ; saves requested _PROFILE.ROUTINE
MOV DWORD [_PROFILE.ROUTINE],.empty ; first we'll profile a simple RET
MOV DWORD [_PROFILE.RETADDR],.ret1
JMP DWORD .profile ; make sure it gets cached
.ret1: MOV DWORD [_PROFILE.RETADDR],.ret2
JMP DWORD .profile ; profile for real (well, let it set up)
.ret2: MOV DWORD [_PROFILE.RETADDR],.ret3
JMP DWORD .profile ; profile for real (well, let it set up again)
.ret3: MOV DWORD [_PROFILE.RETADDR],.ret4
JMP DWORD .profile ; profile for real (well, let it set up one final time)
.ret4: MOV DWORD [_PROFILE.RETADDR],.ret5
JMP DWORD .profile ; profile for real
.ret5: MOV EAX,DWORD [PROFILE.CYCLES+0]
MOV EDX,DWORD [PROFILE.CYCLES+4]
MOV DWORD [_PROFILE.EMPTY+0],EAX ; saves RET cycles count
MOV DWORD [_PROFILE.EMPTY+4],EDX
;
POP DWORD [_PROFILE.ROUTINE] ; restores requested _PROFILE.ROUTINE
MOV DWORD [_PROFILE.RETADDR],.ret6
JMP BYTE .profile ; make sure it gets cached
.ret6: MOV DWORD [_PROFILE.RETADDR],.ret7
JMP BYTE .profile ; profile for real
.ret7: MOV DWORD [_PROFILE.RETADDR],.ret8
JMP BYTE .profile ; make sure it gets cached
.ret8: MOV DWORD [_PROFILE.RETADDR],.ret9
JMP BYTE .profile ; profile for real
.ret9: MOV DWORD [_PROFILE.RETADDR],.ret10
JMP BYTE .profile ; make sure it gets cached
.ret10: MOV EAX,DWORD [_PROFILE.EMPTY+0] ; subtracts simple RET overhead
MOV EDX,DWORD [_PROFILE.EMPTY+4]
SUB DWORD [PROFILE.CYCLES+0],EAX ; saves cycles, low 32bit
SBB DWORD [PROFILE.CYCLES+4],EDX ; saves cycles, high 32bit
MOV EAX,DWORD [_PROFILE.OUT.EAX] ; gives OUTPUT EAX
MOV EBX,DWORD [_PROFILE.OUT.EBX] ; gives OUTPUT EBX
MOV ECX,DWORD [_PROFILE.OUT.ECX] ; gives OUTPUT ECX
MOV EDX,DWORD [_PROFILE.OUT.EDX] ; gives OUTPUT EDX
PUSH DWORD [_PROFILE.OUT.EFL]
POPFD ; gives CPU EFLAGS
JMP DWORD [_PROFILE.RETURN] ; returns to caller
.profile:
MOV EAX,DWORD [PROFILE.CYCLES+0] ; touches caches
MOV EDX,DWORD [PROFILE.CYCLES+4]
MOV EAX,DWORD [_PROFILE.IN.EAX]
MOV EBX,DWORD [_PROFILE.IN.EBX]
MOV ECX,DWORD [_PROFILE.IN.ECX]
MOV EDX,DWORD [_PROFILE.IN.EDX]
MOV ESI,DWORD [_PROFILE.IN.ESI]
MOV EDI,DWORD [_PROFILE.IN.EDI]
MOV EBP,DWORD [_PROFILE.IN.EBP]
MOV EAX,DWORD [_PROFILE.IN.EFL]
MOV EAX,DWORD [_PROFILE.OUT.EAX]
MOV EBX,DWORD [_PROFILE.OUT.EBX]
MOV ECX,DWORD [_PROFILE.OUT.ECX]
MOV EDX,DWORD [_PROFILE.OUT.EDX]
MOV EAX,DWORD [_PROFILE.OUT.EFL]
MOV EAX,DWORD [_PROFILE.ROUTINE]
MOV ECX,32
.stack: PUSH EAX ; touches stack
LOOP .stack
ADD esp,128
XOR EAX,EAX
CPUID ; flush pipelines
RDTSC
MOV DWORD [PROFILE.CYCLES+0],EAX ; saves TSC, low 32bit
MOV DWORD [PROFILE.CYCLES+4],EDX ; saves TSC, high 32bit
XOR EAX,EAX
CPUID ; flush pipelines
MOV EAX,DWORD [_PROFILE.IN.EAX] ; restores INPUT EAX
MOV EBX,DWORD [_PROFILE.IN.EBX] ; restores INPUT EBX
MOV ECX,DWORD [_PROFILE.IN.ECX] ; restores INPUT ECX
MOV EDX,DWORD [_PROFILE.IN.EDX] ; restores INPUT EDX
MOV ESI,DWORD [_PROFILE.IN.ESI]
MOV EDI,DWORD [_PROFILE.IN.EDI]
MOV EBP,DWORD [_PROFILE.IN.EBP]
PUSH DWORD [_PROFILE.IN.EFL]
POPFD ; restores CPU EFLAGS
CALL DWORD [_PROFILE.ROUTINE] ; calls the routine to be tested
MOV DWORD [_PROFILE.OUT.EAX],EAX ; saves OUTPUT EAX
MOV DWORD [_PROFILE.OUT.EBX],EBX ; saves OUTPUT EBX
MOV DWORD [_PROFILE.OUT.ECX],ECX ; saves OUTPUT ECX
MOV DWORD [_PROFILE.OUT.EDX],EDX ; saves OUTPUT EDX
PUSHFD
POP DWORD [_PROFILE.OUT.EFL] ; saves OUTPUT CPU EFLAGS
XOR EAX,EAX
CPUID ; flush pipelines
RDTSC
XCHG DWORD [PROFILE.CYCLES+0],EAX
XCHG DWORD [PROFILE.CYCLES+4],EDX
SUB DWORD [PROFILE.CYCLES+0],EAX ; saves TSC, low 32bit
SBB DWORD [PROFILE.CYCLES+4],EDX ; saves TSC, high 32bit
JMP DWORD [_PROFILE.RETADDR]

align 64 ; align to a cache entry on all CPU's
.empty: RET

; ---------------------------------------------------------------------------

Address:
mov eax,[startupinfo.size]
mov [startupinfo.cb],eax
invoke CreateProcess,file_title,0,0,0,0,REALTIME_PRIORITY_CLASS,0,0,startupinfo,processinfo
;To be very accurate we should also substract CreateProcess' prologue and those two movs.
;I'll implement that later ;)
invoke TerminateProcess,processinfo.hProcess,0

xor eax,eax
mov eax,[PROFILE.CYCLES]
mov edi,_message.number
mov ecx,10
add edi,ecx
convert:
xor edx,edx
div ecx
add dl,30h
mov [edi],dl
dec edi
cmp eax,0
jne convert

invoke MessageBox,0,_message,_caption,MB_OK

finish:
invoke ExitProcess,0

section '.idata' import data readable writeable

library kernel,'KERNEL32.DLL',\
user,'USER32.DLL',\
com,'COMDLG32.DLL'

kernel:
import GetModuleHandle,'GetModuleHandleA',\
VirtualAlloc,'VirtualAlloc',\
CreateProcess,'CreateProcessA',\
ExitProcess,'ExitProcess'

user:
import MessageBox,'MessageBoxA'

com:
import GetOpenFileName,'GetOpenFileNameA'



Right now, it doesn't really work, but I'm still working on it.
I'm using your profiler unchanged, because I don't know exactly how it works, I get lost in
CPUID

RTCSC
Can you explain it a little?


ANd if anybody sees what's wrong with the code, please feel free...
Posted on 2002-08-19 11:34:24 by slop
Hi sloppy, you wrote: Your new version of the PROFILER is just great!

Hardly; I haven't released it yet. ;P

The FASM code I released some days ago is the old NASM one simply rewritten to fit FASM syntax.

The new PROFILE (unreleased yet) is a major rewrite.

I was thinking about that same think, make an executable out of the FASM MACRO, and my idea was to use CreateProcess with the REALTIME_PRIORITY_CLASS flag on.
THat would probably work, because it starts a new whole page for the process, and then quickly returns control.
What do you think?


REALTIME_PRIORITY_CLASS would help, but it's not necessary on ~small routines. In the case that the task is switched away, the (next) PROFILE won't go in error, though, because it selects the most accurate result from 16.. which anyway are almost always the same.. so you may wonder why 16 and not 2 or 3.. well.. reason: let's be paranoid :grin: . Seriously, it has to do with branch prediction misbehaviours, FPU, etc.. 8 would suffice, but I've chosen 16 with future CPU's in mind (or those I have no access to, like the Pentium IV).

Sorry, I'm not even at home anymore (neither at my gf's home).. I apologize but there's still some days to wait for the new PROFILE. You've all surely much more interesting things to do in the while, though.

PS: about CPUID and RDTSC:

CPUID is used here not because it tells you which CPU it's running on, but because it's a serializing instruction (check the Intel manuals for the details), and in substance helps giving accurate results (together with other solutions).

RDTSC accesses to an internal clock-frequency 64 bit counter, which is used to say how many cycles a subroutine takes (but alone wouldn't be precise, because RDTSC is not a serializing instruction, unfortunately (wish we were in ring 0, and we'd use RDMSR, which is)).

Anyway.. it's pretty basic stuff, but the implementation is careful (expecially the next PROFILE).
Posted on 2002-08-19 13:52:49 by Maverick