I'm having the strangest problem ever...

I first noticed something strange when running my C++ code with debugging enabled. Performance was higher than with debugging disabled! After a big search, I focussed down on this code:


inline float fsqrt(float x)
{
__asm
{
rsqrtss xmm0, x
mulss xmm0, x
movss x, xmm0
}

return x;
}

An inoccent function, or so I thought... I isolated it from my project, and compared it's performance with the standard fsqrt function. It is more than four times slower! Changing it to the code below, it outperformed fsqrt as expected:


rsqrtss xmm0, x
rcpss xmm0, xmm0
movss x, xmm0

So what the HELL is going on? Am I really overlooking something or does the original code cause a major internal CPU stall? I'm working on a laptop with Pentium M processor, if that matters.

Obviously I would rather work with the first method, because it uses only one approximation instruction. BTW, Scali suggested this 'optimization' to me. :notsure:
Posted on 2004-03-01 08:43:20 by C0D1F1ED
There's a logical explanation! Although it's still devilish. :grin:

The square root of 0 is 0, as we all know. If we compute it via rsqrt+rcp we first get Inf, then 0. However, with rsqrt+mul we multiply Inf by zero, which results in NaN. That ain't the right answer, and what's worse, doing further calculations with NaN is extremely slow.

So, what appears to be the best method for fast square root computation, behaves very badly in the limit. :( Any idea if this can be avoided?
Posted on 2004-03-06 08:05:40 by C0D1F1ED
Does anyone know what the precision loss is from using rsqrt+rsp? Both are supposed to have 12-bit mantissa precision but I don't know what the resulting precision would be like.
Posted on 2004-03-08 16:23:48 by C0D1F1ED
I would suggest doing a precheck of the value you wish to square root:

inline float fsqrt(float x)

{
__asm
{
lea edx, x
mov eax, [edx]
or eax, eax
jnz @Nope
mov eax, [edx+4]
or eax, eax
jz @End
@Nope:

rsqrtss xmm0, x
mulss xmm0, x
movss x, xmm0

@End:
}

return x;
}


Im sure you can do this somehow with mmx as well, however this should get the point across. 0.0f = 00000000 00000000h

Regards,
NaN
Posted on 2004-03-08 20:32:39 by NaN
What about adding a tiny number to X before doing the sqrt? This should avoid sqt(0), and if the context is a 3D engine this miniscule loss of precision should be acceptable (at least more acceptable than a large performance hit). Or you could try avoiding ending up taking sqrt(0) at all?
Posted on 2004-03-09 01:36:32 by f0dder