Hi Maverick,

Okay, I wrote "Your new version of the PROFILER is just great!" because I knew it?s going to be when you release it ;)

Actually I meant the FASM conversion, I had onli seen the MASM translation by (can?t remember his name, something with 575 i think), and this worked better.

So now I have been fixing my sloppy profiler of your?s exact PROFILEr and now it works:

1) Now imports TerminateProcess;

2) Its a whole .flat section, so that it can write, read and execute code;

3) a bug: mov eax,startupinfo.size fixed;

4) Handles properly the DWtoString

But I can?t make ALIGN macro work.
I have tryed:
1) Privalov's macro;

2)Mavericks's extended (with offset) macro;

3) A few ones made up (of which you can probably see an example in the sources, I like a lot the 'tonto' version ;)

but don?t know what it happens.
The program works:
1) It let's you select a routine to profile;
2) It runs it and gives a message box with the number of cycle it took

The results are not consistent, but because we have to subtract the epilogue yet.

Then with some proggys it gives problems, dunno why. For example with Tomasz's template for Windows it hangs after giving the cycles.

So, if you do, use it at your own risk.
You know "sloppy inside" ;)

If you dare, download it here, in two versions, different ways, but I think they become one same executable.

(Btw, what an incredible mixture: Maverick's precise perfectionism & my ******* <---thanks for editing code.
Hope you don?t mind Fabio,still amigos? Yours gonna be so good that when I see it I'll stop playing ;)
Posted on 2002-08-20 12:23:03 by slop
Hi sloppy,
From a look at your source, it looks like you're profiling a subroutine (at label Address) which contains calls to CreateProcess, and others.

Since PROFILE calls multiple times the routine to be profiled, you're gonna create many processes. Moreover, Win32 API functions execution time is *very* unconstant, while PROFILE requires that for each time it calls the function to be tested it to behave exactly the same way (and take the same amount of time). This is necessary because the first time PROFILE calls the function to be tested, that one will likely not even be in the L2/L1 cache, so it's not a specific need of PROFILE, but rather something that the user will appreciate.

I suggest you to load just once the routine to be tested, set everything up (CreateProcess, etc..), then PROFILE the routine (remember it will be called several times and it has to behave exactly the same way all those times), then do the clean up outside of the routine to be tested.
Posted on 2002-08-21 05:48:36 by Maverick
Hi Maverick,

Incredibly as it may seem, it works.

Give a look to the attached files, run bochs or something if you don't trust them, and see by yourself: the 5 passes are taken internally, I don't know exactly how, but it works, it's one of those miraclos of the Win32pi ;)

And I'm eager to see yours, but I know, this sort of highly specific programming requires time.
Will you also comment it as good as the macro version?


Stami bene,

(If you remember, i told you :alright: I prefer the messageboard :rolleyes: instead of e-mails :eek: because of the :confused: cool smileys.So here...:grin: )
Posted on 2002-08-21 11:31:04 by slop
Hi sloppy, you wrote: Incredibly as it may seem, it works. [..] I don't know exactly how, but it works, it's one of those miraclos of the Win32pi ;) .

The concept of "it works" is a very vague one.. since it may seem to work ok, but it may not be as precise and consistent as it was designed to be - or it may give problems only in one class of CPU. Because of all of this, I will give precise instructions about the next PROFILE, and will not support anything that doesn't follow them 100.0% strictly.. this because I know of the many subtle (and nearly invisible) malfunctions that may arise otherwise, although they aren't macroscopic or too evident.

(If you remember, i told you :alright: I prefer the messageboard :rolleyes: instead of e-mails :eek: because of the :confused: cool smileys.So here...:grin: )

Hmm.. perhaps you sent me an email about the programming languages and stuff.. but I haven't replied yet. Sorry.. too many things to do, it was lost in the mess of my mailbox.

<edit>

PS: I'll reply to the last parts of that email discussion here:

> Then Forth, is great... if I had known it before i had never studyed
> C. I?m quitting C. Now?s Forth for me.

That is a bit of a "hazard".. I suggested you to study Forth - true - but I also said that you probably then won't use it.. I said study it because it anyway opens your mind a lot, and gives a good and intuitive view on stacks and recursion concepts.
I wouldn't quit C, because it's a very widely used/known language.. and thus it's always useful to know/master it. Also, it's a decent language considering the alternatives.


> Do you thing there are many assembler programmers? Or just like Forth?

I think both are very rare.. but that doesn't surprise me. Here you find a lot of quality asm programmers anyway.

</edit>
Posted on 2002-08-22 04:22:41 by Maverick
Yes, Maverick,
again you're right. You see: you are more like a scientist, while I'm more like a... well I was going to say an artist, but that?s Privalov, he truly is one...

About Forth, it was a great thing that you recommended me, I'm sort of grasping it... it?s EVERYTHING, I mean, there are no limits like in other languages I've tryed before. Okay, C is still the standard; I know.

And sure the best ASM programmers are here, we see them in every thread ;)

Listen, I like it when we talk of our personal stuff here, but let?s sort of 'mix' it with general things, else maybe an administrator is going to say something ;)

So here?s the general thing: when do you think it?ll be ready, your new PROFILE?

P?satelo bien,
sloppy.
Posted on 2002-08-22 12:53:14 by slop
PROFILE v2.0 is ready.

Only I don't have it in this PC, so I will post it tomorrow morning (before 12:00 CET).

Thank you all for your patience.
Posted on 2002-08-22 15:11:50 by Maverick
I've posted PROFILE v2.0 here:
http://www.asmcommunity.net/board/index.php?topic=7510

To Nexo and bitRAKE:
On second thought, maybe that weird stall may be due to odd vs even alignment issues on Jcc on the Athlon.

To The Svin:
I'm sorry your request couldn't be fullfilled. The reasons why an inlined "PRE-PROFILE" / code to be tested / "POST-PROFILE" solution is not reliable are many, but just to make an example, take into account even only the lone code cache issues. Even if we allow the inlined code to be tested to be aligned with a cache line (let's take the Pentium cache line size as example: 32 bytes), a piece of code 33 bytes long will take *a lot* more than a 32 bytes one to execute.. giving misleading results.. misleading because when applied in a real world, where the code is not guaranteed to be aligned at all, the 33 bytes one may not be any slower than the 32 bytes one instead (it may be even faster). Even worse, on out-of-order capable CPU's (e.g. P-Pro, K7, etc..) things will depend on even more internal factors, which on the first run will be very unpredictable. Branch prediction will make results even less consistent. Let away that our 33 bytes routine may or may not be already cached, entirely or in part, causing one or two cache line loads from memory (which take a non constant amount of time) making the result highly unpredictable. The ~randomness associated with such a profiling technique is too high to make it really useful.

I think that the only reliable way to profile a piece of code, and mostly for comparative aims (i.e. to find/tune the best optimization of a certain routine for a certain CPU) is the one that profiles a subroutine, as used in PROFILE.

However, given the huge limitations and problems I mentioned before, here's what I feel may be the best implementation of an inlined PRE/POST version of PROFILE (but use it at your own "risk"):


CPUID ; CPUID has different execution times the first two times it gets executed.
MOV EAX,[.CYCLES] ; let's warm at least that part of the cache
FWAIT
CPUID
RDTSC
MOV [.CYCLES],EAX
CPUID

.. your code to be tested goes here ..

FWAIT
CPUID
RDTSC
XCHG EAX,[.CYCLES]
SUB [.CYCLES],EAX

.. the (32 bit) number of cycles it took to execute your code to be tested is now in EAX .. remember though that it still has, although reduced, quite a high degree of randomness, so use PROFILE on a subroutine every time you can.
Posted on 2002-08-23 04:39:07 by Maverick