Well, let's assume we have the following piece of code:

mov eax,

In this case, ecx is playing the role of an index pointer. It is being multiplied by 4 because it is dealing with DWORD data size.

-My question is, where is the multiplication by 4 actually being performed? Does it happen in the same micro-operation block as the MUL instruction?
-If so, how does multiplication using LEA instruction is faster than the MUL instruction if they are both performed in the place? For example:

;this multiplication
mov eax, 15
mov ebx, 10
mul bx

;same as
mov eax, 15
lea eax,
add eax, eax
Posted on 2011-01-22 00:06:12 by banzemanga
No, it's not really a multiply-operation anyway. It can only do factors 1, 2, 4 and 8. Since these are all powers of two, a simple bit shift is enough to perform them.
This is not performed by the ALU itself, but by a special address generation unit (AGU).
Posted on 2011-01-22 03:12:17 by Scali
Cool. That is new to me. I didn't know t was only power of 2 and not done my multiplication unit.

I wonder what is faster:

lea eax,
shl eax, 1

:Or this
mov ecx, eax
shl eax, 1
shl ecx, 3
add eax, ecx

-One takes less instructions but does not allow out of order execution.
-The other one uses one more register and is bigger but has a higher chance of out of order execution.
Posted on 2011-01-22 12:02:15 by banzemanga
That depends a lot on the specific microarchitecture.
Different operations have different latencies and different throughputs for various operations.
Some CPUs will break up certain variations of the lea instruction into two consecutive operations, others can process them in one go.
Since the LEA is executed by the AGU rather than the ALU, some CPUs require an extra cycle to forward the operands between AGU/ALU.

Then there's the shift, which may not always be as fast as you might think. The Pentium 4 architecture for example, took 2-4 cycles for a single shift.
Therefore, an add eax, eax is always preferred over a shl eax, 1. Depending on the situation, a lea eax, may be preferred over a regular add.

I would say that the best solution is probably:
lea eax, 
lea eax,

If you look at your second bit:
mov ecx, eax ; This can execute right away
shl eax, 1  ; This can execute in parallel with the mov
shl ecx, 3  ; This has to wait until the mov is completed
add eax, ecx ; This has to wait for both shl's to complete

So you have created a dependency chain of three instructions here: mov -> shl -> add.
This will take at least 3 cycles.
The other routine has only two dependent instructions, and would take at least 2 cycles. So you could afford to get an extra cycle penalty between AGU/ALU. Combine that with the fact that it only needs one register, and I think this is the preferred form by far.
Posted on 2011-01-22 12:46:39 by Scali
Thanks and awesome Scal. Where did you learn about those? Is there any book in specific i could read to learn about instructions and architecture insights?i
Posted on 2011-01-22 17:33:18 by banzemanga
Well, you can find quite a bit of info on microarchitectures in the Intel Optimization Manual. AMD has a similar manual, but they are not that detailed about their architecture.
You could also look at Agner Fog's optimization resources. He put together some very detailed microarchitecture information for all the popular CPUs.
Posted on 2011-01-23 10:04:17 by Scali