Hi,

As you probably know, the idiv instruction does a round towards zero. This also means the remainder can be negative. But for my specific application I need the remainder to be positive. In other words, I have to compute the flooring divide.

The straightforward C++ code that does this looks like this:


inline int floorDiv(int a, int b)
{
if(b == 0)
{
return 0x80000000; // Prevent exception
}

int q; // Quotient
int r; // Remainder

if(a >= 0)
{
q = a / b;
}
else
{
q = a / b;
r = -a % b;

if(r)
{
q--;
}
}

return q;
}

Which can be simplified to:


if(b == 0) return 0x80000000;
int q = a / b;
if(a < 0 && a % b) q--;
return q;

But I have a few problems optimizing it in assembly. First of all I would like to get rid of the divide by zero test, or implement it without a jump. The rest I would also like to implement with cmov only because my arguments are quite unpredictable and there would be many jump mispredictions. The best I could produce was this:


if(b == 0) return 0x80000000;

const int zero = 0;
const int one = 1;

__asm
{
mov eax, a
cdq
idiv b

mov ebx, 0
cmp a, 0
cmovl ebx, one
test edx, edx
cmovz ebx, zero
sub eax, ebx

mov a, eax
}

return a;

Ugly and slow? I know, I know... my integer code is very rusty. Most of the time I work with the very nice MMX and SSE instruction sets but today is different. That's why I could very much use your help. Any ideas would be very much appreciated as this is a bottleneck of my application! Thanks.
Posted on 2003-10-29 18:28:37 by C0D1F1ED
mov eax,a
cdq
or eax,eax
setl ecx
idiv b

or edx,edx
setnz edx
and ecx,edx
sub eax,ecx
Posted on 2003-10-30 08:44:42 by MikDay

mov eax,a
cdq
or eax,eax
setl ecx
idiv b

or edx,edx
setnz edx
and ecx,edx
sub eax,ecx

Thanks! I forgot about the SET instructions. Why do you use OR instead of TEST? TEST doesn't write back to the register so it seems faster to me...

Anyway, I think I've found an even faster way:


mov eax, a
cdq
idiv ebx

shr edx, 31
sub eax, edx

It is based on the observation that if (a < 0 && a % b) is true then the remainder has a sign bit. And since I have to subtract one from the quotient under this condition, I can just subtract the sign bit. So the C++ equivalent becomes:


inline int floorDiv(int a, int b)
{
if(b == 0) return 0x80000000;
return a / b - (a % b >> 31);
}

I think I''ll leave it this way because the compiler can better inline it.

I'd still like to get rid of the divide by zero check though. I know it has something to do with masking interrupts, but I don't know the details or if it's possible on ring 3. Anyone?
Posted on 2003-10-30 10:47:44 by C0D1F1ED
Hi, C0D1F1ED.
I don't think interrupt masking is possible under ring-3 (not "legally", anyway). You can still do SEH to catch exceptions (there are samples on this in Iczelion's site), but IMHO it's probably more trouble that it's worth if you're looking for a speed optimization.
Posted on 2003-10-30 16:37:55 by QvasiModo

Hi, C0D1F1ED.
I don't think interrupt masking is possible under ring-3 (not "legally", anyway). You can still do SEH to catch exceptions (there are samples on this in Iczelion's site), but IMHO it's probably more trouble that it's worth if you're looking for a speed optimization.

Thanks for the information! Then I'll just attempt to avoid division by zero at a higher level...
Posted on 2003-10-30 19:11:21 by C0D1F1ED
Ok, here are all four functions:


inline int floorDiv(int a, int b)
{
return a / b - ((unsigned)(a % b) >> 31);
}

inline int ceilDiv(int a, int b)
{
a += b - 1;
return a / b - ((unsigned)(a % b) >> 31);
}

inline int floorMod(int a, int b)
{
int r = a % b;
return r + ((r >> 31) & b);
}

inline int ceilMod(int a, int b)
{
int r = (a + b - 1) % b;
return r + ((r >> 31) & b) - b + 1;
}

I won't expect any further optimizations, but the last one still looks sub-optimal, doesn't it? ;)
Posted on 2003-10-30 19:26:48 by C0D1F1ED
I've never used SEH, but I can't see how it would effect optimization!? SEH is setup during program initialization and would only cost cycles when an exception occurs - hopefully your program is not dividing by zero all time. :) IMHO, this exception must be handled - one way or another.
Posted on 2003-10-30 19:41:11 by bitRAKE
Hi, C0D1F1ED
1. IMHO, the instruction TEST reg,reg and instruction OR reg,reg require only 1 CPU clock.
2. I agree with bitRAKE - normally SEH don't effect optimization.
Posted on 2003-10-31 08:27:12 by MikDay
Originally posted by MikDay
1. IMHO, the instruction TEST reg,reg and instruction OR reg,reg require only 1 CPU clock.

It's not an opinion, it's a fact or it isn't. ;) Both instructions are very similar, but OR writes the result back to the destination register. Therefore it creates a dependecy that would otherwise not be there:


test eax, eax
mov ebx, eax

These instructions could, theoretically, execute in parallel. They both only read eax.


or eax, eax
mov ebx, eax

These instrucions can't execute in parallel, because the second has to wait for the first one's result. The processor does not know that the first instruction does not change the value of eax. Actually it could know this by analyzing the instruction type and operands but I'm quite sure they don't waste transistors on this because it should be detected in the compiler's peephole optimizer.
Posted on 2003-10-31 09:49:20 by C0D1F1ED

I've never used SEH, but I can't see how it would effect optimization!? SEH is setup during program initialization and would only cost cycles when an exception occurs - hopefully your program is not dividing by zero all time. :) IMHO, this exception must be handled - one way or another.

I had in mind to set up SEH inside the function... could you believe it didn't occur to me that you could do it on startup instead? (Well, SEH is a rather theorical knowledge to me since I have never actually implemented it yet).
Sorry for that misleading tip. :o
Posted on 2003-11-02 15:19:07 by QvasiModo