I'm having the strangest problem ever...

I first noticed something strange when running my C++ code with debugging enabled. Performance was higher than with debugging disabled! After a big search, I focussed down on this code:

An inoccent function, or so I thought... I isolated it from my project, and compared it's performance with the standard fsqrt function. It is more than four times slower! Changing it to the code below, it outperformed fsqrt as expected:

So what the HELL is going on? Am I really overlooking something or does the original code cause a major internal CPU stall? I'm working on a laptop with Pentium M processor, if that matters.

Obviously I would rather work with the first method, because it uses only one approximation instruction. BTW, Scali suggested this 'optimization' to me. :notsure:

I first noticed something strange when running my C++ code with debugging enabled. Performance was higher than with debugging disabled! After a big search, I focussed down on this code:

```
```

inline float fsqrt(float x)

{

__asm

{

rsqrtss xmm0, x

mulss xmm0, x

movss x, xmm0

}

return x;

}

An inoccent function, or so I thought... I isolated it from my project, and compared it's performance with the standard fsqrt function. It is more than four times slower! Changing it to the code below, it outperformed fsqrt as expected:

```
```

rsqrtss xmm0, x

rcpss xmm0, xmm0

movss x, xmm0

So what the HELL is going on? Am I really overlooking something or does the original code cause a major internal CPU stall? I'm working on a laptop with Pentium M processor, if that matters.

Obviously I would rather work with the first method, because it uses only one approximation instruction. BTW, Scali suggested this 'optimization' to me. :notsure:

There's a logical explanation! Although it's still devilish. :grin:

The square root of 0 is 0, as we all know. If we compute it via rsqrt+rcp we first get Inf, then 0. However, with rsqrt+mul we multiply Inf by zero, which results in NaN. That ain't the right answer, and what's worse, doing further calculations with NaN is extremely slow.

So, what appears to be the best method for fast square root computation, behaves very badly in the limit. :( Any idea if this can be avoided?

The square root of 0 is 0, as we all know. If we compute it via rsqrt+rcp we first get Inf, then 0. However, with rsqrt+mul we multiply Inf by zero, which results in NaN. That ain't the right answer, and what's worse, doing further calculations with NaN is extremely slow.

So, what appears to be the best method for fast square root computation, behaves very badly in the limit. :( Any idea if this can be avoided?

Does anyone know what the precision loss is from using rsqrt+rsp? Both are supposed to have 12-bit mantissa precision but I don't know what the resulting precision would be like.

I would suggest doing a precheck of the value you wish to square root:

Im sure you can do this somehow with mmx as well, however this should get the point across. 0.0f = 00000000 00000000h

Regards,

NaN

```
inline float fsqrt(float x)
```

{

__asm

{

lea edx, x

mov eax, [edx]

or eax, eax

jnz @Nope

mov eax, [edx+4]

or eax, eax

jz @End

@Nope:

rsqrtss xmm0, x

mulss xmm0, x

movss x, xmm0

@End:

}

return x;

}

Im sure you can do this somehow with mmx as well, however this should get the point across. 0.0f = 00000000 00000000h

Regards,

NaN

What about adding a tiny number to X before doing the sqrt? This should avoid sqt(0), and if the context is a 3D engine this miniscule loss of precision should be acceptable (at least more acceptable than a large performance hit). Or you could try avoiding ending up taking sqrt(0) at all?