I'd be of more help on this board if MASM here wasn't the standard.
I don't want to criticize others' choice, be clear, even more considering that almost everybody here is using MASM.. but personally I find it very non-asm to use INVOKE, .IF, etc.. I have my own High Level Assembler and I think that pure assembly should never have such MASM specific constructs.
After all, this is win32asm board, not win32masm board. So what stops me most in supporting the board is that every larger piece of assembly code I write would break on the used-by-most-people MASM.

Personally I find NASM much better.. because of local labels, forward referencing and many other minor reasons.. also, if powerful macros support are what keep you all tied to MASM, then NASM has a very powerful macro processor as well. About Win32 and DirectX API support well, what can I say, I do all of them perfectly with NASM.. so I don't see how MASM would be superior in this regard as well.

Even worse, today I use more and more my own assembler (built in my language's compiler), so I have it hard to releases even public NASM code.

Anyway, when I do so (i.e. write some NASM code instead of my own language's and compiler's code) I thought it'd be better to release it (as NASM code), instead of not doing it just because everybody else uses MASM. I'd like to contribute to the board, when possible.

Anyway, for a start (I'll see by the reactions if it's worth or not to convert to NASM other stuff) here is a routine I use to profile code, I've already posted it in WatcomC++'s inline asm syntax, but that's a very limited assembler, so now I converted it to NASM syntax and posting again.
If anybody is interested in converting my NASM code to MASM, he can do the conversion and post it here publicly.
Sincerely I can't do the -> MASM conversion by myself, because I don't have the time, and because personally I don't like MASM.

PROFILE:

The use is very simple: set up your CPU registers as you would, then instead of "CALL yoursubroutine" use "PROFILE yoursubroutine" (PROFILE is a macro), and you will get in the 64bit variable how many precise cycles it took to your routine to execute (if it took less than 2^32 cycles, you can well access as a 32bit variable). Note that the routine self-adapts itself to any past, present and future CPU (so you don't have to subtract x cycles depending on your CPU), it automatically removes the cost of the last RET in the subroutine it tests, and it calls it several times to ensure that the caches are setup (you can also test it uncached, though, but only in ring0.. e.g. in protected mode Dos, where I used this PROFILE routine most), and to do it correctly it saves and restores the CPU registers at each call.. so the only limitation here is that (being the routine to be tested called more than one time) external pointers or counters must be initialized in your own routine.. about CPU registers instead it's transparent, and behaves like if yoursubroutine to the tested was called only once.

The advantages of this routine? It's 100% precise, consistent and stable, this is my first goal and I've never seen one that behaved better (that's why I wrote this). It self adapts to any CPU.. saving you from this major hassle. It can be used to profile whole subroutines by just using PROFILE instead of CALL, and giving a 64bit result, it can be used to profile long-executing routines as well. I still use it much for comparing different versions of a subroutine on the same dataset, to choose the best one. Also, it's precise down to 0 cycles.. so that's the most interesting thing IMO.. since it's reliable and can be used to test any routine in a perfectly precise way.

Have a look at the source to get a better insight if this "doc" didn't answer to all of your questions. Here's the source anyway:


PS: I may release an "inlined test" profiler later today, which has its own advantages in some specific cases (you want to test the routine only 1 time, being cached or not; and/or you don't want to call a subroutine).. but of course it won't produce results as much consistent and reliable as the routine I just released.
Also, on the old Pentium I had (and hopefully still have somewhere in my HD) a version of PROFILE which, using the MSR's, reported not only the CPU cycles but also stalls, UV execution counts, etc.. etc.. Too bad this stuff is CPU model specific.
Posted on 2002-03-25 06:15:07 by Maverick
Maverick, I moved to MASM from NASM when the development of that assembler became stagnant. Athough it seems to have been kept alive, I believe it has been surpassed by better assemblers: SpASM and FASM. This comment is in regaurd to producing Win32 programs (yes, NASM has the niche between Linux and Windows ASM).
Posted on 2002-03-25 10:31:25 by bitRAKE
bitRake, tranlsate Meveric code to MASM, please.
So more people can use it.
Posted on 2002-03-25 10:36:05 by The Svin
maverick,

i also prefer nasm, or even tasm, over masm, for non-gui things. keep posting nasm sources, masm/tasm users can use they easily by compiling the snippet to a raw binary, and then using a bin2inc tool, or a tool f00der posted in this board to make bin->coff to link it to their code

ancev
Posted on 2002-03-25 13:54:24 by ancev
heck, you can just assemble with nasm and link the .obj with masm
or tasm or <whatever>.
Posted on 2002-03-25 14:37:08 by f0dder
Hate to keep progs I don't use.
Post example then, I'll disasm it and took opcode.
Hate new syntax, new compilers and anything new wich is actually not new but different names for old things.

I've offten heard like HLL programmers missundersood D.Knuth words, they say that Knuth "found usefull to be able to learn new
language each week", the truth is Donald never said that nonsence he told about "new machine language" in other words
to learn new machine, new opcode etc. Feel difference.
Posted on 2002-03-25 15:19:07 by The Svin
ancev: I will, thank you. Also, if I find some 15 minutes I will write a post about that whole bound checking thing, extending the new post/contribution originated by The Svin. I think it's an interesting subject.

f0dder: right, but when one uses macros (I admire who writes such good macros as e.g. bitRAKE does, but I prefer to keep their use at a minimum, again, because for me asm should be asm, and HLA should be HLA), then those macros won't go into the .OBJ files.
Although I don't use the real power of macros, I like to extensively use them e.g. to rename things. I had to search/replace e.g. 'DWORD' in the source I provided (I don't like that word, I prefer a much more logical 'U32').
Posted on 2002-03-25 17:23:22 by Maverick
True, if you use macros to "call" the code to be timed, some source
level changes will have to be done.

And yes, "u32" is nicer than "dword", I use u/s{8,16,32} in my C
source, as well as u/sint and similar. Shorter to write, etc.
Posted on 2002-03-25 17:27:23 by f0dder
Same here.. but I do all uppercase (system code/data and system types), and mixed lower/uppercase (application code/data and types), and all lowercase (local data).
Posted on 2002-03-25 19:33:17 by Maverick
Thanks for the macro Maverick. Gives very accurate results here.
I've translated it to MASM syntax if anyone wants it.
Posted on 2002-03-26 00:00:15 by grv575
Thanks, grv575.

Meveric, it's the best I saw.
Posted on 2002-03-26 11:53:15 by The Svin
Thanks Alex, I'm honoured by your comment. :)
Posted on 2002-03-26 12:11:40 by Maverick
First, thanks this quite useful and it already is being used heavily by me.

The results are rock solid (precision) but on my win98/PII I'm getting tick counts that are one lower than expected. I know that pairing is good but 0 cycles is better than expected ;)

Seriously, no dis intended & this isn't a real problem but I was wondering if anyone else has seen this?

Maybe I've mutated the routine & slimed myself but all I'm doing is running the same test that grv575 did. Well I'm actually doing more with it but in another proggie.
Posted on 2002-03-31 19:39:08 by Mutant Slime
Yes, on the surface, i have to say im also impressed by the ease of this handy tool. Good job Maverick!

However, i have noticed a range of #'s develop with the random # algo's being discussed in another recient thread. The profiler agrees, to a ball park range, and discovered (to no surprise) that the order of instructions preceeding the profiler affects the outcome. adding nop's and other things before the PROFILE comment does affect things, even if you keep the profile macro's together as he has done.

To consolidate this I've found by adding "ALIGN 4" before all critical areas eliminates this varience. Here is the code (to give you the idea)
.586

.model flat,stdcall

include profile.inc

include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
include \masm32\include\masm32.inc
include \masm32\include\debug.inc
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\debug.lib

nrandom PROTO :DWORD
mrandom PROTO :DWORD
nseed PROTO :DWORD

.code

start:
invoke nseed, 1234565
nop
nop
align 4
PROFILE simple_test
PrintDword PROFILECYCLES
PrintDword PROFILECYCLES+4
invoke nseed, 1234565
nop
align 4
PROFILE simple_test2
PrintDword PROFILECYCLES
PrintDword PROFILECYCLES+4

invoke ExitProcess,0


align 4
simple_test proc
invoke nrandom, 10
ret
simple_test endp


align 4
simple_test2 proc
invoke mrandom, 10
ret
simple_test2 endp

;#########################################################################
;
; Park Miller random number algorithm.
;
; Written by Jaymeson Trudgen (NaN)
; Optimized by bitRAKE (Rickey Bowers Jr.)
;
; Size: 55 Bytes, CPU Time: 98 Ticks
;#########################################################################

align 4
nrandom PROC base:DWORD

mov eax, nrandom_seed

xor edx, edx
mov ecx, 127773
div ecx
mov ecx, eax
mov eax, 16807
mul edx
mov edx, ecx
mov ecx, eax
mov eax, 2836
mul edx
sub ecx, eax
xor edx, edx
mov eax, ecx
mov nrandom_seed, ecx
div base

mov eax, edx
ret

nrandom ENDP

; #########################################################################

align 4
nseed proc TheSeed:DWORD

.data
nrandom_seed dd 12345678
.code

mov eax, TheSeed
mov nrandom_seed, eax

ret

nseed endp

; #########################################################################
align 4
mrandom PROC base:DWORD

mov eax, nrandom_seed

xor edx,edx
push 127773
div DWORD PTR [esp]
push eax
mov eax, 16807
mul edx
pop edx
push eax
mov eax, 2836
mul edx
pop edx
sub edx, eax
mov eax, edx
mov nrandom_seed, edx
push base
mov edx, 0
div DWORD PTR [esp]
add esp,8
mov eax, edx

ret
mrandom ENDP

end start


This kept the comparisons the same on every try and *mix* of nop's etc.

With this in place the evalations are consistantly 95 : 102 between the two different versions of random # gen's. (Which still agree's with my brute for method described on the other thread).

Anyways, Thanx again for this handy tool Maverick! I think this would make a nice addition to the MASM package :grin: ( Not trying to be offending to your assembler preference ).

:alright:
NaN
Posted on 2002-03-31 22:56:05 by NaN

First, thanks this quite useful and it already is being used heavily by me.

The results are rock solid (precision) but on my win98/PII I'm getting tick counts that are one lower than expected. I know that pairing is good but 0 cycles is better than expected ;)

Seriously, no dis intended & this isn't a real problem but I was wondering if anyone else has seen this?

Maybe I've mutated the routine & slimed myself but all I'm doing is running the same test that grv575 did. Well I'm actually doing more with it but in another proggie.
That is probably due because of the (documented) fact that the cost of the RET is automatically (and intentionally) removed by the routine. So profiling a routine that just RET results in 0 cycles. One or two NOP plus a RET result in 1 cycle, etc..

As Microsoft would say, "that is a feature, not a bug".. but this time it's true. ;)
Posted on 2002-04-01 00:33:29 by Maverick

Yes, on the surface, i have to say im also impressed by the ease of this handy tool. Good job Maverick!

However, i have noticed a range of #'s develop with the random # algo's being discussed in another recient thread. The profiler agrees, to a ball park range, and discovered (to no surprise) that the order of instructions preceeding the profiler affects the outcome. adding nop's and other things before the PROFILE comment does affect things, even if you keep the profile macro's together as he has done.

To consolidate this I've found by adding "ALIGN 4" before all critical areas eliminates this varience. Here is the code (to give you the idea)
.586

.model flat,stdcall

include profile.inc

include \masm32\include\windows.inc
include \masm32\include\kernel32.inc
include \masm32\include\masm32.inc
include \masm32\include\debug.inc
includelib \masm32\lib\masm32.lib
includelib \masm32\lib\debug.lib

nrandom PROTO :DWORD
mrandom PROTO :DWORD
nseed PROTO :DWORD

.code

start:
invoke nseed, 1234565
nop
nop
align 4
PROFILE simple_test
PrintDword PROFILECYCLES
PrintDword PROFILECYCLES+4
invoke nseed, 1234565
nop
align 4
PROFILE simple_test2
PrintDword PROFILECYCLES
PrintDword PROFILECYCLES+4

invoke ExitProcess,0


align 4
simple_test proc
invoke nrandom, 10
ret
simple_test endp


align 4
simple_test2 proc
invoke mrandom, 10
ret
simple_test2 endp

;#########################################################################
;
; Park Miller random number algorithm.
;
; Written by Jaymeson Trudgen (NaN)
; Optimized by bitRAKE (Rickey Bowers Jr.)
;
; Size: 55 Bytes, CPU Time: 98 Ticks
;#########################################################################

align 4
nrandom PROC base:DWORD

mov eax, nrandom_seed

xor edx, edx
mov ecx, 127773
div ecx
mov ecx, eax
mov eax, 16807
mul edx
mov edx, ecx
mov ecx, eax
mov eax, 2836
mul edx
sub ecx, eax
xor edx, edx
mov eax, ecx
mov nrandom_seed, ecx
div base

mov eax, edx
ret

nrandom ENDP

; #########################################################################

align 4
nseed proc TheSeed:DWORD

.data
nrandom_seed dd 12345678
.code

mov eax, TheSeed
mov nrandom_seed, eax

ret

nseed endp

; #########################################################################
align 4
mrandom PROC base:DWORD

mov eax, nrandom_seed

xor edx,edx
push 127773
div DWORD PTR [esp]
push eax
mov eax, 16807
mul edx
pop edx
push eax
mov eax, 2836
mul edx
pop edx
sub edx, eax
mov eax, edx
mov nrandom_seed, edx
push base
mov edx, 0
div DWORD PTR [esp]
add esp,8
mov eax, edx

ret
mrandom ENDP

end start


This kept the comparisons the same on every try and *mix* of nop's etc.

With this in place the evalations are consistantly 95 : 102 between the two different versions of random # gen's. (Which still agree's with my brute for method described on the other thread).

Anyways, Thanx again for this handy tool Maverick! I think this would make a nice addition to the MASM package :grin: ( Not trying to be offending to your assembler preference ).

:alright:
NaN
When doing the NASM->MASM translation, if I recall correctly grv575 said he couldn't respect the alignment rules. That will definitely be a problem, as reported.
Also, a problem may also be (but unprobably) the fact that MASM defaults to short form of instructions.. but as I said that will unlikely cause problems.

Could you perform the same tests using NASM? Just to see what is the problem (I think alignment which wasn't 100% accurate in the MASM translation).
Posted on 2002-04-01 00:37:25 by Maverick
PS: I tested my routine on a PentiumII / Windows98 and works perfectly. It must be something specific in MASM or in the MASM version.
Posted on 2002-04-01 01:26:54 by Maverick
Well i dont know about NASN, but from what i see, there is no way the macro can *know* where the routine is, to be aligned better when called upon.

This is why i added "ALIGN 4" before the procs that would be called.
align 4
simple_test proc
invoke nrandom, 10
ret
simple_test endp


As well, since the PROFILE macro doesnt parse for multiple arguments (params of the routine being profiled), i added another align to the actual routine such that the invoke would be "smother" between procs.
align 4
nrandom PROC base:DWORD
...
nrandom ENDP


When this was done, MASM would align the addresses of each proc on a 4 byte boundry which would keept the loading of the adresses into the pipelines consistant (i think).

When they werent, any extra bytes added in the code anywhere preceeding the procs, would eventually trickle down and affect the stating address of each proc, and thus cause more overhead when off alignment.


The only way i see NASM doing this is by automatically aligning every proc behind the scenes ??? (but im only guessing).

Anywho, hope this sheds more light..
:alright:
NaN
Posted on 2002-04-01 03:52:29 by NaN
ah.. yeah, I got it wrong then last time ;)
I thought you meant that the PROFILE macro had to be aligned to give proper results (i.e. CALL PROFILE had to be aligned), which didn't make much sense to me at all, to be honest.

The fact that the routines you test should be aligned is simple: I profile a simple RET as a reference, and that simple RET *is* aligned. The whole purpose of my PROFILE code is to reliably compare different routines under the same identical conditions.. so you must provide the same identical conditions for all routines you've to test.. including the built in PROFILE RET test.

I makes a lot of sense then that they all must begin with same alignment.

This alignment is not 4, this alignment must be 64 (a whole cache line). I hope you can do that with MASM.. because it's the way I specified it in my original Na(S)N code ;)

I hope that clears all the doubts, sorry if I misunderstood the previous post.
Posted on 2002-04-01 04:12:06 by Maverick
No probs here :)


Hey bitRAKE!, heres a stumper for you: I racked my brain on this for a couple of hours and got nowhere fast!

Write a custom ALIGN 64 or ALIGN 32 macro..

I thought it would be simple to do, at first. Then i ran into MASM's irritable way of doing things.

My method was:


MyAlign64 MACRO
BB equ 64 - ($ MOD 64)
repeat BB
nop
endm
endm


But you see, this would be too simple.... I found out that $ is termed as 'imedExpr' where repeat wants a 'constExpr'.

The fix in C terms would be casting the type of expression to constant, " BB equ (constExpr)( 64 - ($ MOD 64))"

But this is where i stale-mated and decided to see if you have any wisdom to shed on this idea....

Good luck!
:alright:
NaN
Posted on 2002-04-01 06:45:48 by NaN