Hello,

I'm building my raytracer app and using a modified 3d bresenham algo for traversing the ray.
this loop here is the main bottleneck as it is called very often:

for (i = 0; i < l; i++)
{
if (err_1 > 0) {
py += y_inc;
err_1 -= dx2;
}
if (err_2 > 0) {
pz += z_inc;
err_2 -= dx2;
}
err_1 += dy2;
err_2 += dz2;
px += x_inc;
}

The problem are the two branches, without them the code is ~6 times faster. I tried to eliminate the branch with a cmovg instruction, however the code became SLOWER (!!) than the one with normal jumps. (is there an explanation for that?)
So, do you know any trick which would eliminate that 2 branches? I was thinking about using the fact that the branch only compares with zero but didn't get any solution right now.

Sebastian
Posted on 2004-02-01 07:04:58 by lordseppo13
Sebastian,

I have never yet found an algo that goes faster with cmovXX than jXX. It may be worth you trying to rewrite the intensive code directly in assembler and try and make the branches work in their natural prediction direction which is backwards.

Regards,
http://www.asmcommunity.net/board/cryptmail.php?tauntspiders=in.your.face@nomail.for.you&id=2f46ed9f24413347f14439b64bdc03fd
Posted on 2004-02-01 07:37:10 by hutch--
I once worked on some encryption bruteforcing, lame custom algorithm. By replacing the conditional in "if (++j > 4) j = 0;" with a CMOV and doing loop unrolling, the algorithm ran significantly faster. I can't remember how much it was, but it was enough that I though "wow" :)
Posted on 2004-02-01 07:43:14 by f0dder
What is this for anyway? For stepping through an acceleration grid or something? I hope it's not for the ray/object intersection itself, that can be done much faster :)
Posted on 2004-02-01 08:10:12 by Henk-Jan
it's used for an uniform octree traversal. I don't have objects defined in my scene, instead voxels of size 1, which is actually a *real* 3d pixel.
So the octree itself is binary and I'm sending a ray through it, that's all.

Sebastian
Posted on 2004-02-01 08:22:03 by lordseppo13
I'm estimate a little but it definitely should be faster. Using cmovcc could be slightly faster if any. This is just a loop body.
	mov edx, y_inc

mov ecx, dx2
mov eax, err1
dec eax
shl eax, 1
sbb eax, eax
and edx, eax
and ecx, eax
add py, edx
sub err1, ecx

mov edx, z_inc
mov ecx, dx2
mov eax, err2
dec eax
shl eax, 1
sbb eax, eax
and edx, eax
and ecx, eax
add pz, edx
sub err2, ecx

mov eax, dy2
mov edx, dz2
mov ecx, x_inc
add err1, eax
add err2, edx
add px, ecx
Posted on 2004-02-01 10:13:13 by masquer
Generally cmov can be replaced by a series of cmp, sbb, and, not etc. Agner Fog, AMD and Intel optimization manuals are useful to learn how to avoid unpredictable branches.

This code is not tested :



mov i, L
@@:
xor eax, eax
cmp eax, err_1
sbb edx, edx ; edx = 0 < err_1 ? -1 : 0
cmp eax, err_2
sbb eax, eax ; eax = 0 < err_2 ? -1 : 0
mov ecx, edx ; ecx = 0 < err_1 ? -1 : 0
and edx, y_inc ; edx = 0 < err_1 ? y_inc : 0
and ecx, dx2 ; ecx = 0 < err_1 ? dx2 : 0
add py, edx
sub err_1, ecx
mov edx, eax
and eax, z_inc
and edx, dx2
add pz, eax
sub err_2, edx
add err_1, dy2
add err_2, dz2
add px, x_inc
dec I
jnz @B
Posted on 2004-02-01 12:27:27 by Dr. Manhattan
Thanks everyone for the answers...

the trick with the sbb is really good!

thanks!!

Sebastian
Posted on 2004-02-01 13:29:46 by lordseppo13