While going through some stuff on my webserver, I came across this old article that I wrote, trying to explain some of the basics of CPU pipelining and out-of-order execution.
It's quite old, written back in the early days of OoO, with the Pentium Pro... But there haven't really been major changes in the execution model of CPUs since the Pentium Pro, so it may still be quite useful to people today. So I thought I'd post it here.
======

An execution pipeline is basically a conveyor belt for instructions.
Instructions pass various stages, at each stage, a part of the execution of
the instruction is done. Normally each stage will take 1 clock cycle (clk),
that is, an instruction will stay in a stage for 1 clk, and then proceed on to
the next.
Basically there are 3 main operations:

- Fetch: Load instruction from memory.
- Decode: Set up operands (load from memory if necessary), and find out what
the instruction is supposed to do. Basically prepare to dispatch operands to
the correct logic unit, for execution.
- Execute: Take the operands, dispatch to the correct logic unit, perform the
desired operation, and store the result.

A simplified pipeline could have these three stages: fetch - decode - execute.
The instruction passes through them in that sequence:


        Fetch -> Decode -> Execute
      +--------+--------+--------+
clk 0: | INSTR  |        |        |
      +--------+--------+--------+

      +--------+--------+--------+
clk 1: |        | INSTR  |        |
      +--------+--------+--------+

      +--------+--------+--------+
clk 2: |        |        | INSTR  |
      +--------+--------+--------+


Now if the instruction is being fetched, it cannot be decoded or executed yet.
Does this mean that only 1 stage of the pipeline is active, while the rest is
idle?
The answer is no. Why not? Because while you execute the first instruction,
you can decode the second, and fetch the third at the same time. The different
stages do not share any hardware resources, so each stage can work
independently of the others, at the same time:

        Fetch -> Decode -> Execute
      +--------+--------+--------+
clk 0: | INSTR0 |        |        |
      +--------+--------+--------+

      +--------+--------+--------+
clk 1: | INSTR1 | INSTR0 |        |
      +--------+--------+--------+

      +--------+--------+--------+
clk 2: | INSTR2 | INSTR1 | INSTR0 |
      +--------+--------+--------+


So the stages can work in parallel on sequential instructions. This in effect
reduces the execution time of a sequence of instructions. If you have for
example a piece of code of 10 instructions, it doesn't take 10 x 3 = 30 clks,
but instead, it takes 2 clks for the first instruction to reach the execute
stage, and from there on, at every clk a new instruction arrives at the
execute stage, so you execute the 10 instructions in 10 clks. That's a total
of only 2 + 10 = 12 clks. There is a sort of 'telescoping', or 'waterfall' effect.

The different instructions in the pipeline can be dependent however. It could
be that the instruction that is being decoded, requires an operand that is the
result of the instruction being executed. To solve this problem, operand
forwarding was introduced. This means that the decode stage does not have to
wait for the result to be available. The result is forwarded back into the
pipeline as a new operand, so execution can continue without "stalling"
(waiting):

Consider an add instruction of the form: add destination, source1, source2

INSTR0: add r0, r1, r2
INSTR1: add r4, r0, r3 <- dependency, r0 is modified by the previous
instruction

        Fetch -> Decode -> Execute
      +--------+--------+--------+
clk 0: | INSTR2 | INSTR1 | INSTR0 |->-+
      +--------+--------+--------+  |
                                      |
                        +------<-----+ r0
                        |
      +--------+--------+--------+
clk 1: | INSTR3 | INSTR2 | INSTR1 |
      +--------+--------+--------+


Another problem arises when a jump is made. When you jump to a different
location, the instructions that were already in the pipeline, are now invalid.
So the pipeline needs to be flushed, and start over at the right location.
With unconditional jumps, modern CPUs will spot the jumps in time, and
continue fetching instructions at the new location right away, without
stalling.
With conditional jumps however, this is not possible. It is not clear whether
the jump should be taken or not, until the condition has been tested, which at
worst happens by the last instruction before the conditional jump in the
pipeline. So instead of waiting, a modern CPU will try to predict whether the
jump will be taken or not, judging from a hint in the code (the usual rule is:
if the jump is to a lower address, it is probably a loop, so take the jump.
Else, do not take it. Some CPUs even have hint bits inside the opcode, to
indicate the behaviour), and from previous iterations. They store whether the
jump was taken or not, on the last few occasions. With this strategy, the code
can usually continue without stalling. Only when the prediction turns out to
be wrong, the pipeline will have to be flushed, and execution will stall for a
few clks until the first instruction has reached the execution stage again.

A superscalar CPU is a CPU which has more than one execution pipeline, working
in parallel.
So you can execute more than one instruction per clk. Operand forwarding can
not solve all dependency problems anymore however. It could be that there are
2 instructions being executed in parallel, but the second one requires the
result of the first one. The CPU will then have to put the second pipeline on
hold. It can only execute the first instruction, then forward the result into
the second pipeline. Then the second pipeline can continue executing on the
next clk.
We refer to this as a stall in the second pipeline. The penalty was 1 clk.
It is the responsibility of the programmer to avoid these stalls.

What happens when the clockspeed goes up?
All stages in the pipeline consist of sequences of logic gates. A gate has a
certain propagation time. The time it takes to settle to a consistent state,
after a change of input signals. So the sequence of gates in a pipeline stage
can have at most a propagation time of 1 clk. As clockspeed goes up, the time
that 1 clk takes goes down, so the maximum length of the logic sequence in a
pipeline stage decreases aswell.
There are 2 solutions to this problem:
- Reduce the logic required to fetch, decode and execute instructions
- Split the pipeline up in more stages

In order to reduce the logic required to fetch and decode instructions, the
instructions should be encoded in a less complex fashion.
Reduced Instruction Set Computing (RISC) is a CPU design philosophy which aims
for small and simple instructionsets. Complex instructions which can be
replaced by a sequence of 1 or more simple instructions without significantly
increasing total execution time, will not be implemented in hardware.
Instructions should also be encoded in an orthogonal (uniform) manner, to
reduce decoding complexity. They should all be the same size, and layout.
The old paradox of "less is more" is a nice description of RISC.
To give a name to the previous, more complex designs, the name Complex
Instruction Set Computing (CISC) was chosen. These more complex designs date
from an earlier age, when memory was small and expensive, and clock speeds
were low. There was no pipelining yet, let alone superscalar designs. Code was
mostly written by hand, rather than by compilers. So the designs were made to
minimize the code footprint, and maximize the operations per cycle. This meant
a lot of instructions with implicit operands, and performing common sequences
of operations.

A few results of this philosophy are that virtually all instructions can only
operate on registers, and virtually all instructions have the same number of
operands.
Loading registers from memory, or storing to memory is done with a few special
instructions.
At the time of RISC implementation, manufacturers were able to put millions of
gates on a chip, but because of RISC, way less were needed than ever before.
The extra gates could now be used for more registers, extra cache or other
features.

Splitting the pipeline up in order to increase clockspeed is a poor mans
solution. Namely, if there are interlocks because of dependent instructions
for example, the penalties can increase from 1 to several stages. But, what if
you cannot redesign your instructionset, because you want to keep supporting
legacy code?

There are again 2 solutions:
- Design a new RISC CPU, and support legacy code with a software emulation
layer.
- Design a new RISC CPU, and support legacy code with a hardware emulation
layer.

It is interesting to see that Motorola chose the first option, when going from
68k to PowerPC.
The Apple computers were powered by 68k CPUs. The new PowerPC-powered CPUs had
68k emulation built into the OS. At that time, the choice was not such a bad
one. The new RISC CPU could be manufactured at much higher clockspeeds, and
with more cache and more registers. This gave a leap in performance, which
could make emulated legacy code perform acceptably, while native code would
run at the highest possible speed.

Motorola did however release a 68060, which was designed with the second
option. Which was the same choice that Intel made for their x86 family. The
instructionset was not redesigned totally. They did use some ideas taken from
the RISC philosophy however. In short, their CPUs now translate the complex
x86 instructions internally, and then issue them to an execution pipeline
which could be seen as a sort of RISC CPU. It can execute what Intel calls
"micro-operations" or µOps, since the Pentium PRO (Pentium PRO, Celeron, PII
and PIII all have the same P6 core). And surely, some of the instructions
which would have been removed in a RISC instructionset, are now constructed
from a sequence of 2 or more µOps. The Pentium has no names for the 'atomic
operations' that it builds the instructions from... But it clearly does just
that.
Their pipeline is still longer than that of a true RISC CPU, but it is shorter
than it would be when all instructions were to be decoded and executed
directly, rather than translated to µOps.
Intel managed to keep the implementation of a complex instructionset within
reasonable bounds, and managed to get the clockspeed up to impressive rates.
At the time of writing they have a Pentium 4 CPU out, which runs at 1.5 GHz.
This makes the Pentium 4 the highest clocked CPU currently available.
The high clockspeed comes at a cost however. The instruction throughput is
relatively low, because the pipeline has a lot of stages, because at 1.5 GHz,
the stages have to be very short. On top of that, its decoding scheme is still
more complex than that of CPUs with a RISC instructionset.

But let's look at the x86 family more closely. Because at the backend there is
the RISC-like µOp-execution unit, the CPU also gets RISC-like traits. What
does this mean?
The translation from x86 instructions to µOps is not always an easy one. Only
a part of the x86 instructionset can be considered orthogonal. But the x86
instructions have complex addressing modes. Intel decided to map register-only
operations 1:1 to µOps, and add additional µOps for the complex addressing
modes. Pentium PRO can handle simple addressing modes () in 1 µOp forms
aswell.

A few Pentium examples:

    add eax, edx


is 1 clk.

    add eax, 

 
will translate to:

    mov tmp, 
    add eax, tmp

 
where tmp is a temporary internal register. Takes 2 clks.

    add , edx

 
is three operations on Pentium:

    mov tmp, 
    add tmp, edx
    mov , tmp

 
This is exactly the style in which RISC code is written, since add would only
be able to operate on registers, not on memory directly.
So in short, the CPU seems to translate ordinary x86 code to RISC code.
The downside to having the CPU translate your code, is that the pipeline will
have to wait for the sequence of instructions to finish, before issuing the
next pair of instructions.
When you write the RISC-like code yourself, you are working with 1 clk
operations only, and you can get 2 independent instructions through per clk.
A few instructions are even more interesting on Pentium, because translating
them by hand will actually make them faster.

For example:

    cdq

 
takes 3 clks.

    mov edx, eax
    sar edx, 31

 
takes 2 clks.

    stosd

 
takes 2 clks.

    mov , eax
    add esi, 4

 
takes 1 clk.

    loop label


takes 5 clks.

    dec ecx
    jnz label

 
takes 1 clk.

Basically there are a few extra rules for instructions to execute in 1 clk on
Pentium.
The Pentium has 2 pipelines, but they are not identical, the primary pipeline
is called U, the secondary pipeline is called V. Some instructions can only be
executed in U (eg. shift/rotate instructions), others only in V (eg. call/jump
instructions).
There are also some instructions which will not pair at all. They will only
execute in U, and V will stall.
The decoders also have limits. Each pipeline can decode instructions up to 7
bytes per clk.
If a longer instruction is encountered, there will be a stall.
Finally, the Address Generation Unit (AGU) is a stage earlier in the pipeline.
What this means is that if you have an instruction that has to address memory
with the result of a previous instruction, then it will have to stall for 1
clk. Namely, the operand has to be forwarded not one, but 2 stages back. This
is known as the Address Generation Interlock (AGI) stall:

    add eax, ebx
    mov edx,

        Decode ->  AGU -> Execute
      +--------+--------+--------+
clk 0: | INSTR2 | INSTR1 | INSTR0 |->-+
      +--------+--------+--------+  |
                                      |
                +----------<----------+ eax
                |
      +--------+--------+--------+
clk 1: | INSTR2 | INSTR1 |        |
      +--------+--------+--------+

We say the latency of the instruction is 2 clks. The result has to be
available 2 clks ahead. Most instructions have a latency of 1 clk. Operand
forwarding can avoid stalls on 1 clk latency instructions, as long as there
are no 2 dependent instructions in the pipeline at the same time, as we've
seen earlier.

The Pentium PRO (and other P6 core CPUs) is a completely different beast
however. Instead of translating in-place, as the Pentium does, the Pentium PRO
decodes the x86 instructions into µOps and stores them in a buffer. A
scheduling unit (Reservation Station) will then dispatch µOps to the units as
the operands and the unit become available. This means it can execute
instructions out-of-order (ooo). The results are then stored in a reorder
buffer (ROB), so the result is guaranteed to be equivalent to the original
sequence of instructions. In other words, the out-of-order execution is
transparent to the program.
Why is this interesting? A stall occuring in the early part of the pipeline
will not affect the backend. The decoder might not issue new µOps during the
stall, but while there are still µOps in the ooo buffer, the backend can
continue execution. The ooo execution can also minimize the cost of
latencies/stalls. While 1 µOp has to wait on a previous one to complete, the
CPU can still dispatch other µOps which are not dependent, even if they were
originally sequenced after the dependent instruction.
Posted on 2011-07-21 11:56:29 by Scali
(continued from previous post)

The Pentium PRO has 3 decoders, d0, d1 and d2. Decoder d0 can decode an
instruction with a maximum length of 7 bytes, and a maximum complexity of 4
µOps per clk. Decoders d1 and d2 can decode an additional instruction of 1 µOp
per clk each.
So the ooo buffer can be filled with up to 6 µOps per clk.

The ooo execution pipeline has 5 different ports, port 0 through 4. Each port
handles a specific class of µOps.
Port 0 handles integer and floating point arithmetic, and address generation
operations.
Port 1 handles simple integer arithmetic instructions (not shift, multiply or
divide).
Port 2 handles memory load operations.
Port 3 handles the calculation of memory write addresses.
Port 4 handles memory write operations.

Each port can take 1 µOp per clk. This means that the CPU can dispatch a
maximum of 5 (independent) µOps per clk. Since the decoders decode into the
ooo buffer, they are now no longer dependent on the execution backend. This
means that the decoders no longer have to stall when the execution takes more
than 1 clk. In fact, the ports on a PPRO themselves are pipelined aswell.
Which means that even multiple-clk instructions such as multiply can be
issued every clk, and they pass through the stages subsequently. With PPRO,
you don't specify execution speed of instructions by simply counting the
clks, instead, you specify latency and throughput.
For example, mul is specified as: latency 4 clks, throughput 1 per clk.
This means that when we issue a mul, it takes 4 clks to complete in total.
But you can issue 1 mul per clk. If you issue 2 subsequent muls, then the
second one will follow one clk after the first. So we see the same
'telescoping' or 'waterfall' effect...
1 mul takes 4 clks
2 muls take 5 clks
3 muls take 6 clks
etc.

After execution, the µOps wait in the ROB for "retirement". The retire stage
will update all operands (register or memory) in order, at a maximum of 3 µOps
per clk. It will also take care of the operand forwarding.

This rather complex and very interesting execution scheme shifts the focus
from instruction scheduling to issuing µOps to the ooo backend. Scheduling
instructions for dependency is now less important, because the CPU can
schedule the µOps itself. There should not be so many dependencies that the
ooo-buffer fills up faster than the backend can handle, but not each and every
dependency results directly in a stall anymore. The ooo buffer acts like a
cushion for dependencies and latencies.

You could see the CPU as a funnel for instructions... You can add up to 6
µOps per clk, then it can dispatch up to 5 µOps, and finally retire up to 3
µOps per clk.
When writing code for PPRO CPUs, the main focus should be on using as little
µOps as possible, rather than getting as many µOps per clk into the ooo
buffer. Decoding them efficiently is important aswell. Instructions that
consist of more than 1 µOp should be scheduled to go into d0. Try to get both
d1 and d2 to decode an instruction aswell.

One part we've not touched yet, is the PPRO's register renaming scheme. While
on the outside it appears that the PPRO has only 8 integer registers, it has
in fact 40 temporary registers inside. It uses these temporary registers
internally, for a special kind of dependencies such as this one:

    mov eax, 
    mul
    mov , eax
    mov eax,


We saw earlier that a mul takes 4 clks. If the CPU would use the physical
registers directly, then the last mov would have to wait until the previous
store operation was dispatched. But instead the first three instructions get
one temporary register assigned for eax, and the last instruction gets a new
temporary register assigned again. The eax register has been 'renamed' to a
temporary register. Now the load operation can be dispatched at any time, it
does not have to wait for the previous instructions to finish. The CPU can
rename 3 registers per clk. Each temporary register has its own dependency
chain. Try to keep these chains short, so that the CPU can allocate new
temporary registers early, which will reduce dependencies, and allow the CPU
to dispatch more indepentent µOps.
(note: the old tricks of 'xor reg, reg' or 'sub reg, reg' to set a register
to 0 will not be recognized as independent. A new independent chain will not
be created.)
Posted on 2011-07-21 11:58:19 by Scali