Hi, I was just wondering will I induce any further CPU usage by using something like this:



mov ebx,[edi+ebx*4]


Since its a multiplication, it may put some extra load onto the CPU.
Posted on 2003-02-06 06:59:43 by x86asm
The (valid) scale factors are all powers of 2, so are implemented as a shift.
Its not any real effort for the processor to do a shift.

Mirno
Posted on 2003-02-06 08:25:59 by Mirno
iirc this computation (shift and addition) takes only one clock cycle on a pentium cpu (read that somewhere in agner fog's pentium optimisation guide).
Posted on 2003-02-06 09:32:24 by Tola
I'm not sure about the costs of using Base+Disp*Scale in the context of a MOV instruction. When used with a LEA instruction, however, the scaling factor is likely to slow things down on Pentium 4. See Intel's Optimization Manual 24896607.pdf, p. 2-57:
The lea instruction is not always as fast on the Pentium 4 processor as it is on the Pentium II and Pentium III processors. This is primarily due to the fact that the lea instruction can produce a shift ?op. If the lea instruction uses a shift by a constant amount then the latency of the sequence of ?ops is shorter if adds are used instead of a shift, and the lea instruction is replaced with the appropriate sequence of ?ops.

Regards, Frank
Posted on 2003-02-06 12:30:26 by Frank
x86,

What you have here ,

mov ebx,

is the base address EDI plus the index EBX which is multiplied by the scaling factor 4.

To compare it to other techniques for accessing a memory address you would compare how many single instructions would be needed to do the same thing.

At its worst you would put the base address in a temporary register, seperately multiply the index by the scaling factor and then add it to the base address but this would be far slower that using the standard Intel complex addressing mode.

The Intel data shows that the extra operations in calculating the effective address take extra time in comparison to a simple address like,

mov ebx,

but the calculations are far faster than doing the same with a set of extra instructions.

Regards,

hutch@movsd.com
Posted on 2003-02-06 18:24:33 by hutch--
Thanks for your input guys!!
Posted on 2003-02-06 19:05:23 by x86asm

The Intel data shows that the extra operations in calculating the effective address take extra time in comparison to a simple address like,

mov ebx,

but the calculations are far faster than doing the same with a set of extra instructions.

What they actually compare is more like

lea ebx, ; no scaling factor present
lea ebx, ; scaling factor present

Compared to the first line, the second line suffers a time penalty so big that using a sequence of ADDs instead would be faster. Or, to say the same in Intel's words:
Assembly/Compiler Coding Rule 42. (ML impact, M generality) If an lea instruction which uses the scaled index is on the critical path, the sequence with the adds may be better, but if code density and bandwidth out of the trace cache are the critical factor, then the lea instruction should be used.
(Source: Pentium 4 Optimization Manual, available here; p. 2-58 )

For a related view see here (scroll far far down, to the section called MISTAKE #6).

Of course, this holds only for Pentium 4 processors, not for Pentium 3 or below. And it is about LEA, not about MOV. On the other hand, I can't see why scaling a register should slow down just the LEA instruction, but not the MOV instruction ...

Regards, Frank
Posted on 2003-02-06 21:00:29 by Frank

The Pentium 4 is ugly.. personally I prefer++ the K7 and soon the K8.
Posted on 2003-02-07 03:40:29 by Maverick


For a related view see here (scroll far far down, to the section called MISTAKE #6).



Dang. That page is good.

I did a research paper for our English class about RISC/CISC and I would have LOVED to have read this at the time...
Posted on 2003-02-07 03:56:58 by AmkG
I am much more inclined to trust my own benchmarking between the PIII 600 I have and the 1.5gig PIV I have and the performance difference is by no means trivial.

For a 2.5 times increase in clock speed, I get benchmark times that are between 4 and 5 times faster on the PIV so I am inclined to take such horror stories with a grain of salt at best.

I have also found that the PIV is easier to optimise for than the PIII which was always fussy about what worked well on it.

Frank,

I am aware of the odd difference between PIII and PIV and it is in areas like ADD REG, 1 over INC REG. There are a few others but clocking them is the best way to try them out. I have never found a CMOV## to ever be faster than a conditional jump and that holds across PIII and PIV. LEA sems to be a special case with the architecture change, it used to be a trick instruction from 486 up but it seems Intel have given it less silicon this time, the endless shift to multimedia functions at the expense of the conventional integer types.

Regards,

hutch@movsd.com
Posted on 2003-02-07 07:29:21 by hutch--
hutch I would like to know how you benchmarked the difference between a conditional jump & mov, vs a cmov.
If the benchmark is repeated thousands of times (as is usual for small code benchmarking) it does not accurately reflect the predictability of jumps, the more you repeat a loop, the better the prediction vs. non-prediction of jumps becomes. I think cmov should make a difference when jumping becomes unpredictable, as it won't need to flush the instruction pipeline should if it predicts wrongly (in fact cmovs don't predict at all as far as I'm aware).

Mirno
Posted on 2003-02-07 07:55:08 by Mirno
Mirno,

I have found in every instance that with loop code that is executed at very high rates of repetition that CMOV## has never improved on a CMP/J## so even thought it allows you to remove an unpredictable jump, it has never clocked any faster.

When I was working on the Boyer Moore algos some time ago, there was an important instance where a particular conditional jump was reached every iteration that could be replaced by a CMOV## but every variation I wrote was slower. MB algos are a mess of unpredictable jumps so any reduction in the loop jump count should reduce the problems but the CMOV## instructions in fact did not so I ended up abandoning the instruction and staying with the CMP/J## combination.

Outside of highly repetitious loop code I do not bother much as it does not really matter.

Regards,

hutch@movsd.com
Posted on 2003-02-07 17:14:15 by hutch--
hutch--, you are right, some clocking is required. I have timed both LEA and MOV when used either with or without a scaled index register. Initial results differed slightly depending on the order of the test procedures in the source code. Therefore, I have created a separate executable for each of my four test procedures. See the attached ZIP file (source + executables + test results).

The procedures repeatedly either read a given memory location, or compute that location's effective address. The index register EBX is either scaled (EBX = 1, scaling factor = 8), or it is not scaled (EBX = 8, scaling factor = absent). The output shows the execution time for each of four independent runs of a test procedure. The output also shows the average of the four runs. Times are in milliseconds.

On a Pentium 3 (850 MHz), scaling versus not scaling EBX made no difference for either MOV or LEA:

----------------------------------------
Timing LEA EAX, (lea_1.asm, 52)
time = 1280 (lea_1.asm, 65)
time = 1279 (lea_1.asm, 65)
time = 1273 (lea_1.asm, 65)
time = 1267 (lea_1.asm, 65)
average = 1274 (lea_1.asm, 74)
----------------------------------------
Timing LEA EAX, (lea_2.asm, 52)
time = 1275 (lea_2.asm, 65)
time = 1274 (lea_2.asm, 65)
time = 1268 (lea_2.asm, 65)
time = 1274 (lea_2.asm, 65)
average = 1272 (lea_2.asm, 74)
----------------------------------------
Timing MOV EAX, (mov_1.asm, 52)
time = 1285 (mov_1.asm, 65)
time = 1273 (mov_1.asm, 65)
time = 1272 (mov_1.asm, 65)
time = 1271 (mov_1.asm, 65)
average = 1275 (mov_1.asm, 74)
----------------------------------------
Timing MOV EAX, (mov_2.asm, 52)
time = 1285 (mov_2.asm, 65)
time = 1277 (mov_2.asm, 65)
time = 1275 (mov_2.asm, 65)
time = 1274 (mov_2.asm, 65)
average = 1277 (mov_2.asm, 74)


On a Pentium 4 (1700 MHz), the presence of a scaling factor made a hugh difference for LEA, but not for MOV:

----------------------------------------
Timing LEA EAX, (lea_1.asm, 52)
time = 320 (lea_1.asm, 65)
time = 321 (lea_1.asm, 65)
time = 320 (lea_1.asm, 65)
time = 321 (lea_1.asm, 65)
average = 320 (lea_1.asm, 74)
----------------------------------------
Timing LEA EAX, (lea_2.asm, 52)
time = 681 (lea_2.asm, 65)
time = 671 (lea_2.asm, 65)
time = 651 (lea_2.asm, 65)
time = 661 (lea_2.asm, 65)
average = 666 (lea_2.asm, 74)
----------------------------------------
Timing MOV EAX, (mov_1.asm, 52)
time = 641 (mov_1.asm, 65)
time = 631 (mov_1.asm, 65)
time = 641 (mov_1.asm, 65)
time = 631 (mov_1.asm, 65)
average = 636 (mov_1.asm, 74)
----------------------------------------
Timing MOV EAX, (mov_2.asm, 52)
time = 651 (mov_2.asm, 65)
time = 641 (mov_2.asm, 65)
time = 641 (mov_2.asm, 65)
time = 631 (mov_2.asm, 65)
average = 641 (mov_2.asm, 74)

Apparently, introducing a scaling factor doubled the execution time of the LEA instruction. I suspect that the same happens with the MOV instruction -- however, because the MOV instruction is so much slower to begin with, the effect may simply go unnoticed here. At present I have no idea how one could demonstrate such an effect.

One could look at these execution times from a different angle. My P4 has twice the clock speed of my P3. Actually, three (of four) procedures ran twice as fast on the P4 than on the P3. The exception was the LEA without scaling factor, which ran four times as fast. That would be a real improvement of the processor, beyond the greater clock speed.

Of course someone should reproduce these results on their P4 machine before we can take them for granted.

Regards, Frank
Posted on 2003-02-09 16:13:58 by Frank

I suspect that the same happens with the MOV instruction -- however,
because the MOV instruction is so much slower to begin with, the effect may simply go unnoticed here.

Then how you would explain that in you timing table:
lea EAX,
is slower than
mov EAX,
?
Posted on 2003-02-09 16:41:24 by The Svin
Then how you would explain that in you timing table:
lea EAX,
is slower than
mov EAX,
?


The "lea EAX," procedure took 666ms on average, whereas the "mov EAX," procedure took only 641ms. With procedures running for that long, we cannot meaningfully interpret the absolute values. But we can interpret the time difference between these procedures.

Time difference = 666ms - 641ms = 25ms, which is less than 4% of the larger value.

The processor was clocked at 1.7 GHz
= 1,700 MHz
= 1,700,000 KHz
= 1,700,000,000 Hz
= 1,700,000,000 clocks per second
= 1,700,000 clocks per millisecond.

The time difference amounts to 1,700,000 clocks/ms * 25ms = 42,500,000 clocks.

Hardcoded number of loops (iterations) per procedure = 10,000,000h loops = 268,435,456d loops.

Unexplained clocks per loop = 42,500,000 clocks / 268,435,456 loops = 0.16 clocks per loop. So we have roughly one unexplained clock every six loops.

The loops had been unrolled by four (see the source code). Thus, after every 24 LEA instructions, the processor took an extra cycle that was not observed with the MOV instruction. An interesting, but tiny effect. Maybe the processor can distribute the MOV ?ops more evenly across its execution units than the LEA ?ops? Maybe LEA, but not MOV, competes occasionally with the loop instructions (SUB ECX, 1 / JNZ @B) for execution port 0? Sorry, I have no definitive answer.

Regards, Frank
Posted on 2003-02-10 10:00:09 by Frank

I suspect that the same happens with the MOV instruction -- however,
because the MOV instruction is so much slower to begin with,
the effect may simply go unnoticed here.


Then if lea instruction is generally faster then mov,
and yet using scale > 1 makes it slower then mov
with sib using scale > 1 we can also suppose that
even if there is slowing down effect of using scale > 1 with mov
the effect is much less then using scale > 1 with lea.
That was my point in the first post.
Posted on 2003-02-11 13:27:13 by The Svin


The Pentium 4 is ugly.. personally I prefer++ the K7 and soon the K8.



ya I have a K7 CPU at home also, but how would that impact an AMD CPU?
Posted on 2003-02-13 10:41:16 by x86asm
Then if lea instruction is generally faster then mov,
and yet using scale > 1 makes it slower then mov
with sib using scale > 1 we can also suppose that
even if there is slowing down effect of using scale > 1 with mov
the effect is much less then using scale > 1 with lea.
That was my point in the first post.


Our definitions of "the effect" seem to differ, see below. Let me apologize in advance for a lengthy post. I did try to express it better, but without success. Here we go.

In my view, LEA without shift is fast because two of them can be handled simultaneously in the double-speed ports 0 and 1, within a single processor cycle. => Throughput = 2 instructions/cycle.

Introducing a shift slows down LEA because the shift ties up double-speed Port 1 for a full cycle. => Throughput = 1 instruction/cycle.
That is what I consider "the effect" -- the shift ?op fully consumes Port 1.

MOV is slow even without a shift because it uses a single-speed port (Port 2, memory read). => Throughput = 1 instruction/cycle.

Like LEA with shift, MOV with shift consumes Port 1 for a full cycle. However, that does not hurt here because it happens in parallel with the single-speed memory read in Port 2. => Throughput = 1 instruction/cycle.
So the same effect occurs for the MOV as for the LEA, but simply goes unnoticed.

You correctly point out that in my data, LEA with shift was even slower than MOV with shift. My explanation for the phenomenon is this: in both cases, the shift goes to Port 1, and the remaining two ?ops (computing the sum of ESI and EBX; moving "something" into EAX) go to the other ports. For MOV, these ?ops go to different ports (Port 0 and Port 2); for LEA, they go to the same port (Port 0). For LEA, double-speed Port 0 is now busy: it can handle two ?ops per cycle, and it does get two ?ops per cycle. For MOV, double-speed Port 0 is not busy: it can handle two ?ops per cycle, but it gets only one.

Now consider that a loop in my code consisted of either four LEA or four MOV instructions. For each iteration of a LEA loop, that results in 4 * 0 = 0 empty "slots" in Port 0. For each iteration of a MOV loop, that results in 4 * 1 = 4 empty "slots" in Port 0. The loop overhead "SUB ECX, 1 / JNZ @B" should fit nicely into the empty Port 0 slots of a MOV loop. For the LEA loop, there are no empty slots. Therefore the LEA loop took longer.

I would not consider the additional delay from the loop overhead to be part of "the effect of scaling a register". But of course, your definition may differ.

Regards, Frank
Posted on 2003-02-14 14:48:03 by Frank




ya I have a K7 CPU at home also, but how would that impact an AMD CPU?

Sorry, didn't notice your post until now:
On the K7, the only particular limitation of the LEA instruction is if you use the $66 prefix.. i.e. (Win32 case) if you use the 16 bit form (useful in some tricky optimizations) it gets slow, because it's a Vector Path instruction (i.e. not RISC-like).

I really fail to comprehend why the Pentium 4 designers made some flaws like the one on the (very used) 32 bit form of the LEA instruction.. or the whole FPU. How many transistors do you need to hardwire 1,2,4,8 shifts? Almost none. The K7 is extremely optimized also in CISC-like memory operations using complex addressing modes.
I'm not ashamed to say that I'm not a Intel fan.
Posted on 2003-02-14 15:50:39 by Maverick

Let me apologize in advance for a lengthy post.

:)
I love detailed posts.
So thank you for the detailed clarifcation of your point.
Posted on 2003-02-17 17:39:35 by The Svin