After reading the intel document on optimisation, I was wondering about the partial register stalls. It is said that


Special cases of reading and writing small and large register pairs are implemented in
Pentium Pro and Pentium II processors in order to simplify the blending of code across
processor generations. The special cases are implemented for XOR and SUB when using
EAX, EBX, ECX, EDX, EBP, ESP, EDI and ESI as shown in the following examples:


So is it better to change the following code



@@:
mov ax,WORD PTR[ebx+ecx]
mov BYTE PTR[edi+edx],al
test al,al
jz @F
inc edx
add ecx,2
jmp @B
@@:


to


@@:
xor eax,eax
mov ax,WORD PTR[ebx+ecx]
mov BYTE PTR[edi+edx],al
test al,al
jz @F
inc edx
add ecx,2
jmp @B
@@:
Posted on 2003-04-09 09:31:29 by roticv
Partial register stall means that when you change a partial register (examples - ah, bx, cl, etc) the whole register is assumed to be changed by the instruction, for the purposes of the pairing optimisations done by the CPU. If the documentation says that there are special cases coded in to not assume this behavior of assuming the whole register has been changed, they probably mean that, for example, changing al with one of those instructions you mentioned does not make the processor think that ah has been changed.

I think that the top 2 instructions of the code you gave didn't pair anyway, since the second one depends on the first, so I think that the second alternative isn't probably faster. By the way, all discussion about pairing applies only to the first generation of pentium processors (up to and including the pentium mmx).

What xor is good for is to break dependency chains, but this only works in newer processors. In case you don't know, breaking a dependency chain means making the processor aware that from that point on, a register's content doesn't depend on what it was before.



Err, I think the info above might not be completely accurate. From Agner's fog optimization manual:

"Partial register stall is a problem that occurs when you write to part of a 32 bit register and later read from the whole register or a bigger part of it."

Read chapter 19 from this:

http://www.agner.org/assem/pentopth.zip

Posted on 2003-04-09 09:54:06 by Knightmare
Also isn't "movzx eax, " better than "xor eax,eax / mov ax, " on modern processors?
Posted on 2003-04-10 02:22:16 by f0dder

Sure it is.
Posted on 2003-04-10 02:32:54 by Maverick
Why is it better?

xor is pairable in the U or V pipe
mov is pairable in the U or V pipe

movzx is not pairable

Intel - Appendix A integer pairing tables.

I hope this document is correct for modern processors...
Posted on 2003-05-04 02:05:20 by V Coder
AFAIK, it is always better to work with the full register (32-bit) on latest Intels.
Posted on 2003-05-04 02:11:03 by comrade
V Coder, partial register stall do not happen on the Pentium plain or its MMX variant. They use actual registers, which are directly written to by the instruction.

However, on the P6 core (Pentium Pro, Pentium 2, and Pentium 3), the problem of partial register stalls arived. It is almost certainly a problem with the Pentium 4 core too as it uses a similar mechanism, although it has not been typified yet.
The problem occurs because, in order to get the throughput of the several separate execution engines, they have a bank of mappable registers. So eax does not exist in the same sense as it did on older processors, it is a pointer to a part of this register block. In fact there are pointers for EVERY register in this block, including ax, ah, and al. It takes up to 7 clocks to unify the separate registers. So even though al is a sub-part of eax it will take up to 7 clocks for the processor to sort out the pointers.

The stall occurs because the processor assumes that all is right with the world, and that the pointers are correct. It is not until slightly later on in the pipeline that the processor realises "oops this isn't the correct data", at which point it must flush the pipeline, re-load it with the stalled instruction at the top of the re-loaded stream, and continue. This process will take a lot longer than 7 clocks! So if there is a partial register access directly (or within 7 clocks) behind a load/modification of a register, then a partial register stall will be hit.


In the code below, the addition of the xor will not solve anything though!
It effectively does:
#1 Put register in known zero state
#2 Spoil lower half of register for 7 clocks
#3 Use bad low quarter of register, and stall
#4 test now unified low quarter of register....



@@:
mov al,BYTE PTR[ebx+ecx+1] ; Not sure about the + 1!
mov BYTE PTR[edi+edx],al
test al,al
jz @F
inc edx
add ecx,2
jmp @B
mov mov ax, WORD PTR[ebx+ecx]
@@:


Mirno
Posted on 2003-05-04 04:48:53 by Mirno
Exact same results:
_0:	mov		al, BYTE PTR [ebx + ecx] ; no +1 needed

add ecx, 2

mov BYTE PTR [edi + edx], al
inc edx

test al, al
jnz _0

; delete each instruction for values not needed...
sub ecx, 2
dec edx
mov ah, BYTE PTR [ebx + ecx + 1]
Posted on 2003-05-04 09:14:34 by bitRAKE