I think I saw in a previous post, someone said that manually moving values to the stack pointer is faster than using the push instruction on most modern cpu's ie:
mov ,10
sub esp,4
as opposed to push 10
I think it kind of makes sense when you have a lot of pushes to do (like making function calls and such) because PIII's (and I think II's) can process more than 1 1uop instruction at a time but only one 2+uop instruction at a time (and I think push has two)
I was wondering if someone can confirm this.
I know, it's probably redundant optimization but I'm getting in the habbit of pushing my own parameters when calling an function, instead of using invoke just so I can pass a returned value in eax to more than one function without saving that value in another register or memory.
mov ,10
sub esp,4
as opposed to push 10
I think it kind of makes sense when you have a lot of pushes to do (like making function calls and such) because PIII's (and I think II's) can process more than 1 1uop instruction at a time but only one 2+uop instruction at a time (and I think push has two)
I was wondering if someone can confirm this.
I know, it's probably redundant optimization but I'm getting in the habbit of pushing my own parameters when calling an function, instead of using invoke just so I can pass a returned value in eax to more than one function without saving that value in another register or memory.
Satrukaan, yes that is correct when doing multiple PUSH/POP, that it is faster to MOV after/before updating the stack manually. PUSH/POP also has a dependancy on ESP and changes ESP - this is what slows things down. Also, you can keep the stack aligned better when you doing it all yourself, but it is a slower coding process.
Thanks for the confirmation bitRake,
do you know if this optimization exclusive to intel processors or do AMD processors have that too?
I know this will slow down programming somewhat and I'll probably drop it once I actually start writting a lot of code. But it's good to know
do you know if this optimization exclusive to intel processors or do AMD processors have that too?
I know this will slow down programming somewhat and I'll probably drop it once I actually start writting a lot of code. But it's good to know
Hi Satrukaan,
just one warning that bitRAKE forgot to give you:
You used:
This means that you store the value before you have reserved the stack space for it. I'm not sure if that's safe under Windows, and I know that it is dangerous in most older OS.
If any system routine, using your stack, is activated between the two instructions, then your data is lost. The chance might seem slight, but sooner or later it would happen.
just one warning that bitRAKE forgot to give you:
You used:
mov [esp-4],10
sub esp,4
This means that you store the value before you have reserved the stack space for it. I'm not sure if that's safe under Windows, and I know that it is dangerous in most older OS.
If any system routine, using your stack, is activated between the two instructions, then your data is lost. The chance might seem slight, but sooner or later it would happen.
Thanks for the confirmation bitRake,
do you know if this optimization exclusive to intel processors or do AMD processors have that too?
I know this will slow down programming somewhat and I'll probably drop it once I actually start writting a lot of code. But it's good to know
Also, I like to use the C calling convention for this same reason, and it allows me to design compatible interfaces, layering the parameters on the stack. This way the stack value doesn't fluxuate so greatly when routines need the same values, or similar values. Routines can make the changes on the stack and leave the values there - eliminating a level of indirection and all that pointer passing crap. Sure, if your change a big structure you don't want to pass it on the stack, but if it's local to the parent code - it is already on the stack. Much thought has to go into the design from the start.
What about when dealing with memory locations and/or variables?
How can this:
be faster than this:
even if the first piece of code is quicker, it is far more prone to errors. While those errors may not matter much to you when you are just cutting some code in your spare time for the fun of it, it does become a pain in the *** if you ever release something, especially if it is a commercial release :) While Satrukaan did acknowledge that the optimisation was 'unnecessary', i would still be reluctant to implement the first bit of code above into even a time-critical algo, there would be very few cases where you just have to save those two or three clock cycles.
How can this:
mov eax, memLoc/variable
sub esp, 4
mov [esp], eax
be faster than this:
push memLoc/variable
even if the first piece of code is quicker, it is far more prone to errors. While those errors may not matter much to you when you are just cutting some code in your spare time for the fun of it, it does become a pain in the *** if you ever release something, especially if it is a commercial release :) While Satrukaan did acknowledge that the optimisation was 'unnecessary', i would still be reluctant to implement the first bit of code above into even a time-critical algo, there would be very few cases where you just have to save those two or three clock cycles.
sluggy, I wont make a case for a single push/pop. ;)
sluggy, I wont make a case for a single push/pop.
I know what you mean... i was just hoping to point out that optimisations like that are great to know about, but their uses are usually academic only, and in the above case it was only useful if putting immediate values on the stack.Originally posted by sluggy
I know what you mean... i was just hoping to point out that optimisations like that are great to know about, but their uses are usually academic only, and in the above case it was only useful if putting immediate values on the stack.
I don't agree with the underlined portion. Yes, the above example isn't a very good one. The memory pointed to by ESP is very important during program flow and access to that memory can be optimised. Just because the tools don't exist to make it easy, doesn't mean it's academic. Programs can be designed that ensure ESP is always aligned with little or no overhead - I am not talking about Intel's documented methods in their manual, that kind of overhead is senseless in all but the most remote situations. The speed increase through better cache utilization and reduced call overhead is real for procedures that are call many times. I would like to suggest a whole program approach using an aligned calling convention, that can be realized with custom EPILOGUE/PROLOGUE macros. Sometimes if you have a procedure that is being called millions of times you can inline the code, but I'm speaking more toward code on an interface boundary.I know what you mean... i was just hoping to point out that optimisations like that are great to know about, but their uses are usually academic only, and in the above case it was only useful if putting immediate values on the stack.