What would be the fastest way to implement a C++ 'for' loop like this in assembly:
x0 can, in rare cases, be equal to x1, so the code block inside the loop is not executed. For the average case the number of iterations is small.
At first thought I would place the test for < x1 at the top, and at the bottom always jump back up unconditionally, like this:
But I'm not sure if this is optimal for branch prediction. And to leave the loop, it jumps twice, while with a test at the bottom it would fall through. So is maybe a test at the top (once) and at the bottom (to loop back) more optimal?
Thanks.
for(int x = x0; x < x1; x++)
{
// ...
}
x0 can, in rare cases, be equal to x1, so the code block inside the loop is not executed. For the average case the number of iterations is small.
At first thought I would place the test for < x1 at the top, and at the bottom always jump back up unconditionally, like this:
mov eax, x0
xloop:
cmp eax, x1
jge xout
// ...
add eax, 1
jmp xloop
xout:
But I'm not sure if this is optimal for branch prediction. And to leave the loop, it jumps twice, while with a test at the bottom it would fall through. So is maybe a test at the top (once) and at the bottom (to loop back) more optimal?
Thanks.
You could use a jmp before the loop :
It will save a jmp in the loop. I'm not sure but it could be bad for prefetching because it's a forward jump (if x1 is far away from x2 it should not be a problem)
jmp InLoop
xLoop:
; loop code
inc x0
InLoop:
cmp x0, x1
jne xLoop
It will save a jmp in the loop. I'm not sure but it could be bad for prefetching because it's a forward jump (if x1 is far away from x2 it should not be a problem)
jmp InLoop ; always predicted by CPU
ALIGN 8 ; yes!
xLoop:
; loop code
inc x0
InLoop:
cmp x0, x1
jne xLoop ; mis-predicted once on exit
Sometimes it would be better to use one register or memory location rather than two: going from x0 to x1, we know MAX(0,x1-x0) itterations are being performed.
Cases where it is better to use MAX(0,x1-x0):
A. no registers availible (use memory)
B. loop index used within the loop
BTW, awesome article in ShaderX2.
You are doing really great work!
Thanks for the information guys, exactly what I was looking for!
How significant would the align 8 be? I see that with this approach it can't have negative consequences... It is to make instruction fetch align at the start of a cache line, right? This year I'm getting an 'Advanced Computer Architecture' course at university and it's really showing me how to write even better assembly code.
BTW, awesome article in ShaderX2.
You are doing really great work!
Thanks! If you liked that one, try ShaderX3. 8) It should be available any week now. And I'm still working on new technology...
How significant would the align 8 be? I see that with this approach it can't have negative consequences... It is to make instruction fetch align at the start of a cache line, right? This year I'm getting an 'Advanced Computer Architecture' course at university and it's really showing me how to write even better assembly code.
BTW, awesome article in ShaderX2.
You are doing really great work!
Thanks! If you liked that one, try ShaderX3. 8) It should be available any week now. And I'm still working on new technology...
How significant would the align 8 be? I see that with this approach it can't have negative consequences... It is to make instruction fetch align at the start of a cache line, right?
The cache lines are 32/64 bytes, but the instructions are decoded in blocks of 8 or 16 bytes, iirc. Actually, I have had code run faster when not aligned. If there is no dependancy problems the goal should be to pack the most instructions into 8 byte blocks. It is a balancing act between all the stages of the pipeline. Also, this might effect how code is stored in the instruction cache - newer CPUs store the macro ops in the instruction cache to bypass decoding of cached instructions.
Funny thing is how instructions that are not even executed can slow down code! Intel advises this in their optimization manual - be careful what you use for multi-byte NOP and that no dependant instructions are around it.
Interesting suggestions, bitRAKE.
There is a rule about code alignment from Intel optimization reference manual:
There is a rule about code alignment from Intel optimization reference manual:
Assembly/Compiler Coding Rule 29. (M impact, H generality) All branch targets should be 16-byte aligned.