What would be the fastest way to implement a C++ 'for' loop like this in assembly:


for(int x = x0; x < x1; x++)
{
// ...
}

x0 can, in rare cases, be equal to x1, so the code block inside the loop is not executed. For the average case the number of iterations is small.

At first thought I would place the test for < x1 at the top, and at the bottom always jump back up unconditionally, like this:


mov eax, x0
xloop:
cmp eax, x1
jge xout
// ...
add eax, 1
jmp xloop
xout:

But I'm not sure if this is optimal for branch prediction. And to leave the loop, it jumps twice, while with a test at the bottom it would fall through. So is maybe a test at the top (once) and at the bottom (to loop back) more optimal?

Thanks.
Posted on 2004-10-27 10:32:38 by C0D1F1ED
You could use a jmp before the loop :



jmp InLoop
xLoop:
; loop code

inc x0
InLoop:
cmp x0, x1
jne xLoop


It will save a jmp in the loop. I'm not sure but it could be bad for prefetching because it's a forward jump (if x1 is far away from x2 it should not be a problem)
Posted on 2004-10-27 11:54:01 by Dr. Manhattan
    jmp InLoop ; always predicted by CPU


ALIGN 8 ; yes!

xLoop:
; loop code

inc x0
InLoop:
cmp x0, x1
jne xLoop ; mis-predicted once on exit
Sometimes it would be better to use one register or memory location rather than two: going from x0 to x1, we know MAX(0,x1-x0) itterations are being performed.

Cases where it is better to use MAX(0,x1-x0):
A. no registers availible (use memory)
B. loop index used within the loop


BTW, awesome article in ShaderX2.
You are doing really great work!
Posted on 2004-10-27 12:08:56 by bitRAKE
Thanks for the information guys, exactly what I was looking for!

How significant would the align 8 be? I see that with this approach it can't have negative consequences... It is to make instruction fetch align at the start of a cache line, right? This year I'm getting an 'Advanced Computer Architecture' course at university and it's really showing me how to write even better assembly code.

BTW, awesome article in ShaderX2.
You are doing really great work!

Thanks! If you liked that one, try ShaderX3. 8) It should be available any week now. And I'm still working on new technology...
Posted on 2004-10-27 18:16:03 by C0D1F1ED
How significant would the align 8 be? I see that with this approach it can't have negative consequences... It is to make instruction fetch align at the start of a cache line, right?
The cache lines are 32/64 bytes, but the instructions are decoded in blocks of 8 or 16 bytes, iirc. Actually, I have had code run faster when not aligned. If there is no dependancy problems the goal should be to pack the most instructions into 8 byte blocks. It is a balancing act between all the stages of the pipeline. Also, this might effect how code is stored in the instruction cache - newer CPUs store the macro ops in the instruction cache to bypass decoding of cached instructions.

Funny thing is how instructions that are not even executed can slow down code! Intel advises this in their optimization manual - be careful what you use for multi-byte NOP and that no dependant instructions are around it.
Posted on 2004-10-27 18:48:57 by bitRAKE
Interesting suggestions, bitRAKE.

There is a rule about code alignment from Intel optimization reference manual:
Assembly/Compiler Coding Rule 29. (M impact, H generality) All branch targets should be 16-byte aligned.
Posted on 2004-11-05 16:29:15 by MazeGen