This is actually multiple questions in a single thread but they are all related.

I read somewhere that for simple arithmetic instructions, the smaller the register size the faster it is. For example:
{sub al, 48} is faster than {sub eax, 48}
Of course, that is assuming that numbers in the arithmetic operations are always small enough not to cause carries or overflows. I wonder if that is still true in modern processors.

What about movzx vs mov? If XOR is only used once to clear whatever register and then use MOV in a loop afterwards. Is there any optimization gain?

Let's assume the result has to be in a 32-bit size but the input is an 8-bit size:

mov esi, offset string
xor eax, eax
xor ebx, ebx
again:
add eax, ebx
movzx ebx,byte ptr
sub ebx, 48
jnb again


Replaced with the alternative:

mov esi, offset string
xor eax, eax
xor ebx, ebx
again:
add eax, ebx
mov bl, 
sub bl, 48
jnb again

Is there going to be any gain in optimization?
Posted on 2011-01-26 16:20:39 by banzemanga

I read somewhere that for simple arithmetic instructions, the smaller the register size the faster it is. For example:
{sub al, 48} is faster than {sub eax, 48}
Of course, that is assuming that numbers in the arithmetic operations are always small enough not to cause carries or overflows. I wonder if that is still true in modern processors.


It's not true.
In fact, using 'partial registers' (al/ah/ax etc, rather than the full eax, ebx etc) can be suboptimal. Namely, the CPU only supports full registers internally.
If you use al, it will use a full 32-bit register internally... That's all fine and well... But if you modify al, and then read ax or eax, it has to combine the new value of al from the 32-bit register with the rest of the register. This causes a so-called 'partial register stall'.
An exception is implemented: the CPU will know that it does not have to combine the registers, if the register was 0 beforehand. For example:
xor eax, eax ; CPU will set an internal flag to signal that eax is 0
mov al, 8 ; CPU will allocate an internal register for al
add eax, 10 ; CPU will know that eax was 0 beforehand, so it skips the recombining stage and just uses the internal 'al' register as eax, and performs the add immediately


What about movzx vs mov? If XOR is only used once to clear whatever register and then use MOV in a loop afterwards. Is there any optimization gain?


On some CPUs movzx may be slightly slower. It all depends. On the other hand, movzx is a good way to avoid the partial register stall described above.


Let's assume the result has to be in a 32-bit size but the input is an 8-bit size:

mov esi, offset string
xor eax, eax
xor ebx, ebx
again:
add eax, ebx
movzx ebx,byte ptr
sub ebx, 48
jnb again


Replaced with the alternative:

mov esi, offset string
xor eax, eax
xor ebx, ebx
again:
add eax, ebx
mov bl, 
sub bl, 48
jnb again

Is there going to be any gain in optimization?


The xor ebx, ebx at the top should avoid partial register stalls on the add eax, ebx. The second version will be slightly faster on CPUs which perform movzx slower than mov.
You could have a look here: http://www.agner.org/optimize/instruction_tables.pdf
It would seem that on AMD K7, K8 and K10 (different architectures? yea right) for example, movzx r,m has higher latency than mov r,m.
Other than that, only the ancient Pentium 1/MMX seems be slower with movzx (then again, those don't have the partial register problem, since they don't do out-of-order execution and register renaming). All the modern Intel architectures (Pentium 4, Core2, Core i7, even Atom) have the same performance for mov and movzx.
You seem to have forgotten a lea esi, somewhere in the loop, though :)
Posted on 2011-01-27 02:01:08 by Scali
I was under the impression that the internal register size for most modern pc processors is 80 bits, and for gpu its 48 bits, but otherwise, I agree with everything :)
Posted on 2011-01-27 02:26:19 by Homer

I was under the impression that the internal register size for most modern pc processors is 80 bits, and for gpu its 48 bits, but otherwise, I agree with everything :)



80 bits? Only for the FPU.
GP registers are 64-bits these days, but since this was 32-bit code, I decided not to over-complicate the explanation. In 32-bit mode, 64-bit CPUs work exactly the same as 32-bit CPUs.
Obviously SSE registers are 128 bits, and AVX is 256 bits.

No idea where you get 48 bits for a GPU... Latest generation GPUs have 64-bit double precision IEEE754.
Posted on 2011-01-27 02:57:45 by Scali
48 bit color space was used internally by some Adobe products, I read that this was because it was the 'native gpu color resolution' at the time of writing, and accepted that at face value.

Posted on 2011-01-27 05:30:51 by Homer

48 bit color space was used internally by some Adobe products, I read that this was because it was the 'native gpu color resolution' at the time of writing, and accepted that at face value.


But colour resolution is something completely different from GPU registers. Generally the GPU's internal registers and ALUs have higher precision than the rendertarget.
Besides, 48 bit colour space will probably mean either 12:12:12:12 or 16:16:16. So that would be 12 or 16 bits of precision per component.
A modern GPU with double precision has 64 bit precision per component, of which 53 bits are the mantissa.
Posted on 2011-01-27 05:52:30 by Scali