I think I saw in a previous post, someone said that manually moving values to the stack pointer is faster than using the push instruction on most modern cpu's ie:

mov ,10
sub esp,4

as opposed to push 10

I think it kind of makes sense when you have a lot of pushes to do (like making function calls and such) because PIII's (and I think II's) can process more than 1 1uop instruction at a time but only one 2+uop instruction at a time (and I think push has two)

I was wondering if someone can confirm this.
I know, it's probably redundant optimization but I'm getting in the habbit of pushing my own parameters when calling an function, instead of using invoke just so I can pass a returned value in eax to more than one function without saving that value in another register or memory.
Posted on 2002-06-12 13:56:12 by Satrukaan
Satrukaan, yes that is correct when doing multiple PUSH/POP, that it is faster to MOV after/before updating the stack manually. PUSH/POP also has a dependancy on ESP and changes ESP - this is what slows things down. Also, you can keep the stack aligned better when you doing it all yourself, but it is a slower coding process.
Posted on 2002-06-12 14:31:14 by bitRAKE
Thanks for the confirmation bitRake,
do you know if this optimization exclusive to intel processors or do AMD processors have that too?

I know this will slow down programming somewhat and I'll probably drop it once I actually start writting a lot of code. But it's good to know
Posted on 2002-06-12 15:23:36 by Satrukaan
Hi Satrukaan,
just one warning that bitRAKE forgot to give you:

You used:


mov [esp-4],10
sub esp,4


This means that you store the value before you have reserved the stack space for it. I'm not sure if that's safe under Windows, and I know that it is dangerous in most older OS.

If any system routine, using your stack, is activated between the two instructions, then your data is lost. The chance might seem slight, but sooner or later it would happen.
Posted on 2002-06-12 16:37:13 by RAdlanor

Thanks for the confirmation bitRake,
do you know if this optimization exclusive to intel processors or do AMD processors have that too?

I know this will slow down programming somewhat and I'll probably drop it once I actually start writting a lot of code. But it's good to know
Yes, this applies to the AMD chips as well, and there are section in the optimization guide regaurding this.

Also, I like to use the C calling convention for this same reason, and it allows me to design compatible interfaces, layering the parameters on the stack. This way the stack value doesn't fluxuate so greatly when routines need the same values, or similar values. Routines can make the changes on the stack and leave the values there - eliminating a level of indirection and all that pointer passing crap. Sure, if your change a big structure you don't want to pass it on the stack, but if it's local to the parent code - it is already on the stack. Much thought has to go into the design from the start.
Posted on 2002-06-12 16:58:27 by bitRAKE
What about when dealing with memory locations and/or variables?

How can this:


mov eax, memLoc/variable
sub esp, 4
mov [esp], eax


be faster than this:


push memLoc/variable


even if the first piece of code is quicker, it is far more prone to errors. While those errors may not matter much to you when you are just cutting some code in your spare time for the fun of it, it does become a pain in the *** if you ever release something, especially if it is a commercial release :) While Satrukaan did acknowledge that the optimisation was 'unnecessary', i would still be reluctant to implement the first bit of code above into even a time-critical algo, there would be very few cases where you just have to save those two or three clock cycles.
Posted on 2002-06-12 21:52:41 by sluggy
sluggy, I wont make a case for a single push/pop. ;)
Posted on 2002-06-12 21:56:48 by bitRAKE
sluggy, I wont make a case for a single push/pop.
I know what you mean... i was just hoping to point out that optimisations like that are great to know about, but their uses are usually academic only, and in the above case it was only useful if putting immediate values on the stack.
Posted on 2002-06-13 00:04:29 by sluggy
Originally posted by sluggy
I know what you mean... i was just hoping to point out that optimisations like that are great to know about, but their uses are usually academic only, and in the above case it was only useful if putting immediate values on the stack.
I don't agree with the underlined portion. Yes, the above example isn't a very good one. The memory pointed to by ESP is very important during program flow and access to that memory can be optimised. Just because the tools don't exist to make it easy, doesn't mean it's academic. Programs can be designed that ensure ESP is always aligned with little or no overhead - I am not talking about Intel's documented methods in their manual, that kind of overhead is senseless in all but the most remote situations. The speed increase through better cache utilization and reduced call overhead is real for procedures that are call many times. I would like to suggest a whole program approach using an aligned calling convention, that can be realized with custom EPILOGUE/PROLOGUE macros. Sometimes if you have a procedure that is being called millions of times you can inline the code, but I'm speaking more toward code on an interface boundary.
Posted on 2002-06-13 00:27:09 by bitRAKE