Hi all,

Being interested in profiling I searched the board for it, and amongst others (like VTune and CodeAnalist) I came across Maverick's PROFILEr. He also released version 2.0 of it.

In the link below (dated 17/18 september 2002, so it's an old thread) there is a statement by Maverick that Profile v2.0 had some secondary issues to be fixed at that time.

http://www.asmcommunity.net/board/index.php?topic=7996.0

His statement was:
I still haven't had the time (and expecially the attitude) to fix some secondary issues of v2.. So people may prefer to use the older one.


I couldn't find the answers on the following questions:
- The secondary issues of v2: Have they been dealt with since?
- Can version 2 be used safely at the moment?
- If not, can one still use the older version? Are there any drawbacks?
- As I saw in the threads regarding the older version of PROFILE: There were problems translating the NASM version into Masm. What is the most reliable MASM translation at the moment? (I hope it is still being used)

Can anyone shape some light in the darkness?

Friendly regards,
mdevries.
Posted on 2007-03-01 08:57:05 by mdevries

Hi,


Hi all,

Being interested in profiling I searched the board for it, and amongst others (like VTune and CodeAnalist) I came across Maverick's PROFILEr. He also released version 2.0 of it.

In the link below (dated 17/18 september 2002, so it's an old thread) there is a statement by Maverick that Profile v2.0 had some secondary issues to be fixed at that time.

http://www.asmcommunity.net/board/index.php?topic=7996.0

His statement was:
I still haven't had the time (and expecially the attitude) to fix some secondary issues of v2.. So people may prefer to use the older one.


I couldn't find the answers on the following questions:
- The secondary issues of v2: Have they been dealt with since?


It's been a very long time. No, I don't think I worked more on it, sorry.


- Can version 2 be used safely at the moment?


If you test it on your system (CPUs that didn't exist at the time) and it appears to work correctly (expecially not giving negative results when you profile nothing or maybe a NOP) then I assume it's safe.


- If not, can one still use the older version? Are there any drawbacks?


Same as above I think.


- As I saw in the threads regarding the older version of PROFILE: There were problems translating the NASM version into Masm. What is the most reliable MASM translation at the moment? (I hope it is still being used)


Sorry, I've never been a MASM user, and I'm not a NASM user anymore (FASM rocks anytime!).


Can anyone shape some light in the darkness?


I am very sorry I cannot shed more light than that. It's been a long time and a lot of bits passed through my systems since then. I haven't really done much x86 asm since those times by the way.. I moved to other CPUs to get much more fun.


Friendly regards,
mdevries.


Greets,
Maverick

PS: please do not hate me :)

Posted on 2007-03-02 05:54:42 by Maverick
Thanks for stepping by and answering, anyway :)

Posted on 2007-03-02 09:26:17 by f0dder
Are any/all of the versions available ?

Multiple searches revealed posts without any attachments.

Such as:
http://www.asmcommunity.net/board/index.php?topic=7510
Posted on 2007-03-02 11:21:35 by dsouza123
Hi,

Thanks Maverick for your response.

If you test it on your system (CPUs that didn't exist at the time) and it appears to work correctly (expecially not giving negative results when you profile nothing or maybe a NOP) then I assume it's safe.


As you suggested I just tried both versions.

I profiled 2 types of functions with the older version of PROFILE.
- a function without parameters
- a function with 1 parameter
The first time the only thing the function contained was a RET.
The next time a NOP and a RET.
Then 2 NOP's and a RET,
Then 3 NOP's and a RET, etc.

After having done this I profiled the same functions with the newer version of PROFILE. These were the results on my Intel Pentium M notebook, running under Windows XP:


                      PROFILE              PROFILE v2.0
                0 Param  1 Param      0 Param  1 Param

0 nop ret        0        -1            0        0
1 nop ret        -4        2            0        -3
2 nop ret        -4        1            0        2
3 nop ret        -3        -2            0        1
4 nop ret        -5        4            1        -2
5 nop ret        0        0            0        -3
6 nop ret        -2        -2            0        2
7 nop ret        -8        0            2        4
8 nop ret        -4        -2            2        3
9 nop ret        -2        4            1        6
10 nop ret        -1        3            -3        6
11 nop ret        -1        2            -2        4
12 nop ret        -5        5            3        3
13 nop ret        -6        7            -1        3
14 nop ret        -1        7            -1        5
15 nop ret        -4        7            4        5
16 nop ret        1        6            4        7
17 nop ret        -4        5            5        8
18 nop ret        4        7            5        9
19 nop ret        2        6            5        8
20 nop ret        3        7            7        6
21 nop ret        3        10            2        8
22 nop ret        3        8            7        11
23 nop ret        4        7            8        12
24 nop ret        4        15            7        12


There was a stack unbalance after profiling the function with the parameter.
ESP before profiling: 0012FFC4h
ESP after profiling:  00130000h
So, there was a difference of 60 bytes, too much popped off the stack.

The stack unbalance doesn't appear after profiling the function without parameters.


This is the code I found and used for the older version of PROFILE. I hope this was the definite MASM version, that solved the alignment problems.


.686p

; ---------------------------------------------------------------------------

PROFILE                        MACRO          ROUTINE
                                MOV            ,ROUTINE
                                CALL            _PROFILE
ENDM

; ---------------------------------------------------------------------------
PROFILE_ALIGN_CODE MACRO
  PROFILE_ENDS_CODE MACRO
    _TEXT$1 ENDS
  ENDM
_TEXT$1 SEGMENT PAGE READONLY PUBLIC
ENDM
; ---------------------------------------------------------------------------
PROFILE_ALIGN_BSS MACRO
  PROFILE_ENDS_BSS MACRO
  _BSS$1 ENDS
  ENDM 
_BSS$1 SEGMENT PAGE PUBLIC
ENDM
; ---------------------------------------------------------------------------
PROFILE_ALIGN_DATA MACRO
  PROFILE_ENDS_DATA MACRO
  _DATA$1 ENDS
  ENDM
_DATA$1 SEGMENT PAGE PUBLIC
ENDM
; ---------------------------------------------------------------------------


; -------------------------------------

PROFILE_ALIGN_BSS  ; -------------------------------------------------------

                              ALIGN          64                              ; align to a cache entry on all CPU's
PROFILEROUTINE                DWORD            ?
PROFILECYCLES                  DWORD            ?
                              DWORD            ?
_PROFILEEMPTY                  DWORD            ?
                              DWORD            ?
_PROFILERETURN                DWORD            ?
_PROFILEINEAX                  DWORD            ?
_PROFILEINEBX                  DWORD            ?
_PROFILEINECX                  DWORD            ?
_PROFILEINEDX                  DWORD            ?
_PROFILEINESI                  DWORD            ?
_PROFILEINEDI                  DWORD            ?
_PROFILEINEBP                  DWORD            ?
_PROFILEINEFL                  DWORD            ?
_PROFILEOUTEAX                DWORD            ?
_PROFILEOUTEBX                DWORD            ?
_PROFILEOUTECX                DWORD            ?
_PROFILEOUTEDX                DWORD            ?
_PROFILEOUTEFL                DWORD            ?
_PROFILERETADDR                DWORD            ?

PROFILE_ENDS_BSS    ; -------------------------------------------------------

; -------------------------------------

PROFILE_ALIGN_CODE  ; -------------------------------------------------------

                                ALIGN          64 ;16                              ; align to a cache entry on all CPU's
_PROFILE:
                                MOV            [_PROFILEINEAX],EAX            ; saves INPUT EAX (will be trashed by CPUID)
                                MOV            [_PROFILEINEBX],EBX            ; saves INPUT EBX (will be trashed by CPUID)
                                MOV            [_PROFILEINECX],ECX            ; saves INPUT ECX (will be trashed by CPUID)
                                MOV            [_PROFILEINEDX],EDX            ; saves INPUT EDX (will be trashed by CPUID)
                                MOV            [_PROFILEINESI],ESI            ; saves INPUT ESI (will be trashed by the routine to be tested, which will be called multiple times)
                                MOV            [_PROFILEINEDI],EDI            ; saves INPUT EDE (will be trashed by the routine to be tested, which will be called multiple times)
                                MOV            [_PROFILEINEBP],EBP            ; saves INPUT EBP (will be trashed by the routine to be tested, which will be called multiple times)
                                PUSHFD
                                POP            [_PROFILEINEFL]                ; saves INPUT CPU EFLAGS
                                POP            [_PROFILERETURN]                ; saves return address
                                PUSH                            ; saves requested PROFILEROUTINE
                                MOV            ,@empty        ; first we'll profile a simple RET
                                MOV            [_PROFILERETADDR],@ret1
                                JMP            @profile                        ; make sure it gets cached
@ret1:                          MOV            [_PROFILERETADDR],@ret2
                                JMP            @profile                        ; profile for real (well, let it set up)
@ret2:                          MOV            [_PROFILERETADDR],@ret3
                                JMP            @profile                        ; profile for real (well, let it set up again)
@ret3:                          MOV            [_PROFILERETADDR],@ret4
                                JMP            @profile                        ; profile for real (well, let it set up one final time)
@ret4:                          MOV            [_PROFILERETADDR],@ret5
                                JMP            @profile                        ; profile for real
@ret5:                          MOV            EAX,
                                MOV            EDX,
                                MOV            [_PROFILEEMPTY+0],EAX          ; saves RET cycles count
                                MOV            [_PROFILEEMPTY+4],EDX
                                ;
                                POP                            ; restores requested PROFILEROUTINE
                                MOV            [_PROFILERETADDR],@ret6
                                JMP            @profile                        ; make sure it gets cached
@ret6:                          MOV            [_PROFILERETADDR],@ret7
                                JMP            @profile                        ; profile for real
@ret7:                          MOV            [_PROFILERETADDR],@ret8
                                JMP            @profile                        ; make sure it gets cached
@ret8:                          MOV            [_PROFILERETADDR],@ret9
                                JMP            @profile                        ; profile for real
@ret9:                          MOV            [_PROFILERETADDR],@ret10
                                JMP            @profile                        ; make sure it gets cached
@ret10:                        MOV            EAX,[_PROFILEEMPTY+0]          ; subtracts simple RET overhead
                                MOV            EDX,[_PROFILEEMPTY+4]
                                SUB            ,EAX          ; saves cycles, low 32bit
                                SBB            ,EDX          ; saves cycles, high 32bit
                                MOV            EAX,[_PROFILEOUTEAX]            ; gives OUTPUT EAX
                                MOV            EBX,[_PROFILEOUTEBX]            ; gives OUTPUT EBX
                                MOV            ECX,[_PROFILEOUTECX]            ; gives OUTPUT ECX
                                MOV            EDX,[_PROFILEOUTEDX]            ; gives OUTPUT EDX
                                PUSH            [_PROFILEOUTEFL]
                                POPFD                                          ; gives CPU EFLAGS
                                JMP            [_PROFILERETURN]                ; returns to caller
@profile:
                              ;WBINVD (if you want to test uncached data/code) ; beware, WBINVD is only for ring0
                                MOV            EAX,          ; touches caches
                                MOV            EDX,
                                MOV            EAX,[_PROFILEINEAX]
                                MOV            EBX,[_PROFILEINEBX]
                                MOV            ECX,[_PROFILEINECX]
                                MOV            EDX,[_PROFILEINEDX]
                                MOV            ESI,[_PROFILEINESI]
                                MOV            EDI,[_PROFILEINEDI]
                                MOV            EBP,[_PROFILEINEBP]
                                MOV            EAX,[_PROFILEINEFL]
                                MOV            EAX,[_PROFILEOUTEAX]
                                MOV            EBX,[_PROFILEOUTEBX]
                                MOV            ECX,[_PROFILEOUTECX]
                                MOV            EDX,[_PROFILEOUTEDX]
                                MOV            EAX,[_PROFILEOUTEFL]
                                MOV            EAX,
                                MOV            ECX,32
@@stack:                        PUSH            EAX                            ; touches stack
                                LOOP            @@stack
                                ADD            esp,128
                                XOR            EAX,EAX
                                CPUID                                          ; flush pipelines
                                RDTSC
                                MOV            ,EAX          ; saves TSC, low 32bit
                                MOV            ,EDX          ; saves TSC, high 32bit
                                XOR            EAX,EAX
                                CPUID                                          ; flush pipelines
                                MOV            EAX,[_PROFILEINEAX]            ; restores INPUT EAX
                                MOV            EBX,[_PROFILEINEBX]            ; restores INPUT EBX
                                MOV            ECX,[_PROFILEINECX]            ; restores INPUT ECX
                                MOV            EDX,[_PROFILEINEDX]            ; restores INPUT EDX
                                MOV            ESI,[_PROFILEINESI]
                                MOV            EDI,[_PROFILEINEDI]
                                MOV            EBP,[_PROFILEINEBP]
                                PUSH            [_PROFILEINEFL]
                                POPFD                                          ; restores CPU EFLAGS
                                CALL                            ; calls routine to be tested
                                MOV            [_PROFILEOUTEAX],EAX            ; saves OUTPUT EAX
                                MOV            [_PROFILEOUTEBX],EBX            ; saves OUTPUT EBX
                                MOV            [_PROFILEOUTECX],ECX            ; saves OUTPUT ECX
                                MOV            [_PROFILEOUTEDX],EDX            ; saves OUTPUT EDX
                                PUSHFD
                                POP            [_PROFILEOUTEFL]                ; saves OUTPUT CPU EFLAGS
                                XOR            EAX,EAX
                                CPUID                                          ; flush pipelines
                                RDTSC
                                XCHG            ,EAX
                                XCHG            ,EDX
                                SUB            ,EAX          ; saves TSC, low 32bit
                                SBB            ,EDX          ; saves TSC, high 32bit
                                JMP            [_PROFILERETADDR]

                                ALIGN          64 ; 16                              ; align to a cache entry on all CPU's
@empty:                        RET

PROFILE_ENDS_CODE  ; -------------------------------------------------------



This is my little test function:

Note that PROFILE_ALIGN_CODE and PROFILE_ENDS_CODE are NaN's macros to start and end a page aligned new code segment.
I used the test function for both the older and the newer version of PROFILE.


PROFILE_ALIGN_CODE
align 64

testfunc proc ;Param1:dword ;Uncomment Param1 to have 1 parameter
        nop ;Add more nops here, and restart profiling.
        ret
testfunc endp

PROFILE_ENDS_CODE


This is the way I called the test function, using the older version of PROFILE:
I am using BiteRider's ObjAsm32's debug system (DebugCenter) to show the results.


mov eax,esp
DbgHex eax,"esp before profiling"

mov eax,offset _PROFILE
DbgHex eax,"Profile base address"

;push eax ;uncomment this to have 1 parameter
PROFILE testfunc

DbgHex qword ptr , "Cycles after profiling"

mov eax,esp
DbgHex eax,"esp after profiling"



Now the code for PROFILE v2.0, partly borrowed from MArtial_Code (January 18, 2003):

This is an include file that includes the binary code from the Profile.bin file.
The binary code was supplied by Maverick to make PROFILE v2.0 assembler independent:


PROFILERSIZE equ 660
PROFILERINITOFFSET equ 559
PROFILERRESULTOFFSET equ 4096
PROFILERSTARTOFFSET equ 1

_PROF1 segment dword 'PROFILER' ;separate section
;Profiler proc

Profiler:
db 195,137,45,56,0,0,0,189,0,0,0,0,156,143,69,60
db 137,125,52,137,117,48,137,85,44,137,77,40,137,93,36,137
db 69,32,143,69,16,139,69,8,11,69,12,117,72,199,5,193
db 1,0,0,0,0,0,0,129,45,193,1,0,0,197,1,0
db 0,199,69,92,0,0,0,0,199,69,88,0,0,0,0,199
db 69,20,91,0,0,0,233,246,0,0,0,232,168,0,0,0
db 255,69,88,131,125,88,16,114,230,139,69,0,139,85,4,137
db 69,8,137,85,12,88,163,193,1,0,0,129,45,193,1,0
db 0,197,1,0,0,199,69,92,0,0,0,0,199,69,88,0
db 0,0,0,199,69,20,159,0,0,0,233,178,0,0,0,232
db 100,0,0,0,255,69,88,131,125,88,16,114,230,139,133,0
db 1,0,0,49,219,185,1,0,0,0,57,132,141,0,1,0
db 0,118,9,139,132,141,0,1,0,0,137,203,65,59,77,92
db 114,232,139,132,221,128,0,0,0,139,148,221,132,0,0,0
db 43,69,8,27,85,12,137,69,0,137,85,4,139,69,64,139
db 93,68,139,77,72,139,85,76,139,109,80,255,53,84,0,0
db 0,157,255,37,16,0,0,0,139,69,0,139,85,4,185,0
db 0,0,0,57,132,205,128,0,0,0,117,18,57,148,205,132
db 0,0,0,117,9,255,132,141,0,1,0,0,235,34,65,59
db 77,92,114,223,137,132,205,128,0,0,0,137,148,205,132,0
db 0,0,199,132,141,0,1,0,0,1,0,0,0,255,69,92
db 195,139,125,52,139,117,48,139,69,0,139,69,32,139,69,64
db 185,4,0,0,0,131,236,32,139,4,36,73,117,247,129,196
db 128,0,0,0,255,117,60,157,155,235,5,144,144,144,144,144
db 15,162,15,49,137,69,0,137,85,4,139,85,44,139,77,40
db 139,93,36,139,69,32,139,109,56,199,68,36,252,192,1,0
db 0,255,100,36,252,144,144,144,144,144,144,144,144,144,144,144
db 144,144,144,144,144,144,144,144,144,144,144,144,144,144,144,144
db 232,0,0,0,0,199,68,36,252,0,2,0,0,255,100,36
db 252,144,144,144,144,144,144,144,144,144,144,144,144,144,144,144
db 144,144,144,144,144,144,144,144,144,144,144,144,144,144,144,144
db 144,144,144,144,144,144,144,144,144,144,144,144,144,144,144,144
db 137,45,80,0,0,0,189,0,0,0,0,137,85,76,137,77
db 72,137,93,68,137,69,64,156,143,69,84,155,15,162,15,49
db 43,69,0,27,85,4,137,69,0,137,85,4,255,101,20,86
db 87,232,90,0,0,0,129,231,0,240,255,255,141,183,0,16
db 0,0,1,127,47,1,127,51,1,127,57,1,127,61,1,127
db 82,1,127,119,1,127,125,1,191,129,0,0,0,1,191,150
db 0,0,0,1,191,157,1,0,0,1,191,201,1,0,0,1
db 119,3,1,119,8,1,183,253,0,0,0,1,183,4,1,0
db 0,1,183,2,2,0,0,1,183,7,2,0,0,95,94,195
db 139,60,36,195
db 4096*2-PROFILERSIZE dup (0) ;Combined size of section is 8192 bytes

_PROF1 ends


This is the way I called my test function, using PROFILE v2.0:


mov eax,esp
DbgHex eax,"esp before profiling"

invoke VirtualAlloc,0,8192,MEM_COMMIT,PAGE_EXECUTE_READWRITE
mov MyAlloc,eax
DbgHex eax,"Alloc base address"

;copy Maverick's binary code to the allocated memory
cld
mov esi,offset Profiler
mov edi,MyAlloc
mov ecx,8192/4
rep movsd

mov esi,MyAlloc
add esi,PROFILERINITOFFSET
call esi

mov esi,MyAlloc
add esi,PROFILERSTARTOFFSET

;push eax ;uncomment this to have 1 parameter
push offset testfunc
call esi

mov esi,MyAlloc
DbgHex qword ptr

mov eax,esp
DbgHex eax,"esp after profiling"

invoke VirtualFree,MyAlloc,0,MEM_RELEASE



If I did everything right, my conclusion is:
Both the older and the newer versions of PROFILE don't produce the correct amount of cycles anymore, on nowadays processors.

But I hope I did something wrong, for that would mean, my conclusion is wrong, and that could mean: PROFILE is still alive!!

If some of you are still using PROFILE succesfully on the newer processors, or if you can find something wrong in the code, please respond. For I would like PROFILE to be doing it's job again.

Friendly regards,
mdevries.
Posted on 2007-03-03 16:30:58 by mdevries
Unfortunately on these newer CPUs there are additional factors that should be counted in. I stopped myself at branch cache (simulating a flush) but, as you may imagine, on a Pentium M there's also trace cache issues and "God" only knows how many other little, subtle stuff to "sync". Sorry, I can't help because I don't have any Pentium M (let away the time to do it, honestly), but you or others may do it, all you need to know is that the processor must be "reset" in a clean BUT I/D-cached (to avoid memory stalls polluting the results) state before executing the code under test. Every CPU has its one cache line sizes, pipeline quircks, architecture, etc.. so at this point I think it's better to write one PROFILE routine at least per CPU family..

Posted on 2007-03-05 02:45:26 by Maverick
Hi Maverick,.

Thanks for your reply.

You did a wonderful job with PROFILE 5 years ago, although it can't be used on modern cpu's anymore, since many things in the cpu have changed since.
I think PROFILE must be considered a great achievement anyway.

Friendly regards,
mdevries.
Posted on 2007-03-06 12:54:56 by mdevries