I have n 3d-vectors (in 32-bit floating point), v.
All v have length <= 1, and n > 50.
What is the fastest (and numerically stable) way to sum all 1/(v.v)?
(v.v == dotproduct of v with itself)

SSE/3dnow! can do a fast 1/x, can't it? wonder if that's useful.. (vanilla routine is more important though). Ideas more important than source code - keeping numerically stable is an especially important point.

With about 50 vectors, the following (pseudo)code wasn't numerically stable.


dividend = 1.0f, divisor = 1.0f;

for each v[n] {
dist = v[n].v[n]
dividend *= dist;
dividend += divisor;
divisor *= dist;
}

result = dividend/divisor;
Posted on 2003-03-31 12:42:16 by f0dder
For numerical stability, you should use FPU and never consider SSE as a way to go. That is because using SSE may result in overflow/underflow even if the final result is representable in single precision. This cannot happen on FPU with single precision numbers (even with double precision numbers!) as long as you set PC to the default value. I mean, the default value by Intel. And, most OSes set PC to double precision, and this is perfectly fine with single precision numbers.


You mean, result = (dividend - divisor)/divisor = (dividend/divisor) - 1, right? Then, numerical stability of each formula will depend on the magnitude of each component. Otherwise, FPU should give you exactly what you want for single precision. If you were not using FPU register as a temporary storage, do so. Popping out FPU register to memory for temporary data storage defeats the advantage of FPU over SSE.
Posted on 2003-03-31 14:52:52 by Starless
You do realize that if any one of the v(n) values is zero, the result of the reciprocal will be "infinity" and the eventual sum will also be infinity. If a value of zero is not possible, I would suggest the following code if all the v(n) values are in a continuous array.
(If a value of zero is possible, I would have to verify what would happen when trying to use the generated "infinity" value with the other FPU instructions before recommending that code.)

finit ;clear the FPU
fldz ;zero the summation
mov ecx,number_of_vectors ;use for count
lea eax,vector_array ;use for array pointer
@@:
fld1 ;load 1 on FPU
fdiv dword ptr ;get the reciprocal
add eax,4 ;adjust for next vector
fmul st,st(0) ;get the square of the reciprocal
;(much faster than another div by a memory value)
dec ecx
fadd ;add new result to summation
jnz @B ;FPU instructions don't affect CPU flags

;your sum result is now in st(0)

Raymond
Posted on 2003-03-31 21:22:06 by Raymond
Raymond, your code does not look faster than a direct translation of f0dder's algorithm into FPU instructions. Frankly, your code was exactly what I thought when I first replied to his posting.

Then, later, I realized that his algorithm was much faster because the main loop would contain fmul and fadd only. There is no way to overflow/undeflow an FPU register by his algorithm (with his intended single precision numbers) under 32-bit OS with 32-bit addres-space full of memory. So the numerical stability is secured, unless some pathological series of |x| values come in (including what you pointed out, |x| = 0).
Posted on 2003-04-01 00:17:20 by Starless
I checked out the code I suggested with one or more embedded 0's in an array. The code still works fine.

Just make sure to check the resulting sum for a value of INFINITY before you use it elsewhere.

Raymond
Posted on 2003-04-01 09:41:35 by Raymond
With f0dder's algo you can just check 'divisor'. If it's zero, the result is infinite and the division is not needed.

Do you know that this sounds like metaballs?
Posted on 2003-04-04 01:32:17 by gliptic
it is, gliptic. And the method I presented did give numerical instability - not with intels compiler, though. Guess one ought to have a look at the instructions produced.
Posted on 2003-04-04 02:01:10 by f0dder
F0dder used:

(1/a)+(b/c)=(c+a*b)/(a*c)

The problem is that your divisor grows faster than your dividend.

The only obvious way to solve the overflow problem is to normalize your v, predivide all your v by the max value of the v[] array.

JC
Posted on 2003-04-05 06:20:37 by MCoder
Could the vectors be quantized to, for example, 32-bit triplets? That would probably require 64-bit number processing though.
Posted on 2003-04-05 14:39:56 by gliptic
Another (dumb) idea.

You need to compute something like:

1/A + 1/B + 1/C + 1/D

This gives:
(A*B*C + A*B*D + A*C*D + B*C*D) / (A*B*C*D)

11 multiplications and 1 division instead of 4 divisions, and I think there is no overflow in this case.
We may reduce the number of divisions even further.

Suppose you have to add between 50 and 100 inverses, you can perhaps have routines for every case ?
If you don't know the maximum number of inverse to add, I guess this method is useless.
Posted on 2003-04-05 16:28:00 by MCoder
No, it's not equivalent to the first code.

The first code is based on:

(1/a)+(b/c)=(c+a*b)/(a*c)

and dividend=c+a*b, divisor=a*c
The error cumulates, since c is bigger than a*b, thus you loose precision on the dividend.

In this case:
(A*B*C + A*B*D + A*C*D + B*C*D) / (A*B*C*D)

you don't loose precision.

BTW, another suggestion (which is useful when we are trying to multiply or divide):
why not using log ?
log(1/a)=-log(a)
If you can store the logs in a second table, you'll simply have to add them.
Posted on 2003-04-05 16:40:56 by MCoder
Ok, I found a way to make the computation stable.

Suppose f0dder's code is stable for 4 terms (this can be checked easily).
You can use the usual trick of subdivising your problem in smaller problems:



i=0;
result=0;
while(n>=4)
{
dividend = 1.0f, divisor = 1.0f;
repeat 4 times:
dist = v[i].v[i];
dividend *= dist;
dividend += divisor;
divisor *= dist;
++i;
end repeat
result+= dividend/divisor;
n-=4;
}
while(n)
{
dist=v[i]*v[i];
result += 1/dist;
++i;
--n;
}


This gives 8 multiplications and 1 division for 4 values.

If you want better numerical stability, you can use 3 terms with:
1/a + 1/b + 1/c = (a*b+a*c+b*c)/(a*b*c)
This can be done in 4 multiplications and 1 division.

With 4 terms:
1/a+1/b+1/c+1/d=(a*b*c+a*b*d+a*c*d+b*c*d)/(a*b*c*d)
I'm curious to know: in how much multiplications can you compute that ?
Posted on 2003-04-06 04:35:11 by MCoder
Ok, you can save one multiplication with:

1/a+1/b+1/c+1/d=(a*b*c+a*b*d+a*c*d+b*c*d)/(a*b*c*d)

if you pose:
K=a*b
L=c*d

(a*b*c+a*b*d+a*c*d+b*c*d)/(a*b*c*d)
=
(K*c+K*d+a*L+b*L)/(K*L)

=
(K*(c+d) + L*(a+b)) / (K*L)

5 multiplications, 2 additions and 1 division !
Posted on 2003-04-06 05:33:04 by MCoder
Here's MCoders suggestion knocked up in Perl (it ain't a million miles from C code). I generalised the method suggested in the last few posts. 2 N Muls, N Adds and 1 Division :)
$size = scalar @numbers;

$num1 = @numbers[$idx-1] + @numbers[$idx-2];
$num2 = @numbers[$idx-1] * @numbers[$idx-2];

for ($idx = $size - 3; $idx >= 0; $idx = $idx - 1) {

$num1 = $num1 * @numbers[$idx] + $num2;
$num2 = $num2 * @numbers[$idx];
}

$ans = $num1/$num2;
Posted on 2003-04-06 07:51:57 by Eóin
Sorry, wasn't paying attention :rolleyes:
Posted on 2003-04-06 08:04:20 by Eóin