I have n 3d-vectors (in 32-bit floating point), v.

All v have length <= 1, and n > 50.

What is the fastest (

(v.v == dotproduct of v with itself)

SSE/3dnow! can do a fast 1/x, can't it? wonder if that's useful.. (vanilla routine is more important though). Ideas more important than source code - keeping numerically stable is an especially important point.

With about 50 vectors, the following (pseudo)code wasn't numerically stable.

All v have length <= 1, and n > 50.

What is the fastest (

**and numerically stable**) way to sum all 1/(v.v)?(v.v == dotproduct of v with itself)

SSE/3dnow! can do a fast 1/x, can't it? wonder if that's useful.. (vanilla routine is more important though). Ideas more important than source code - keeping numerically stable is an especially important point.

With about 50 vectors, the following (pseudo)code wasn't numerically stable.

```
```

dividend = 1.0f, divisor = 1.0f;

for each v[n] {

dist = v[n].v[n]

dividend *= dist;

dividend += divisor;

divisor *= dist;

}

result = dividend/divisor;

For numerical stability, you should use FPU and never consider SSE as a way to go. That is because using SSE may result in overflow/underflow even if the final result is representable in single precision. This cannot happen on FPU with single precision numbers (even with double precision numbers!) as long as you set PC to the default value. I mean, the default value by Intel. And, most OSes set PC to double precision, and this is perfectly fine with single precision numbers.

You mean, result = (dividend - divisor)/divisor = (dividend/divisor) - 1, right? Then, numerical stability of each formula will depend on the magnitude of each component. Otherwise, FPU should give you exactly what you want for single precision. If you were not using FPU register as a temporary storage, do so. Popping out FPU register to memory for temporary data storage defeats the advantage of FPU over SSE.

You mean, result = (dividend - divisor)/divisor = (dividend/divisor) - 1, right? Then, numerical stability of each formula will depend on the magnitude of each component. Otherwise, FPU should give you exactly what you want for single precision. If you were not using FPU register as a temporary storage, do so. Popping out FPU register to memory for temporary data storage defeats the advantage of FPU over SSE.

You do realize that if any one of the v(n) values is zero, the result of the reciprocal will be "infinity" and the eventual sum will also be infinity. If a value of zero is not possible, I would suggest the following code if all the v(n) values are in a continuous array.

(If a value of zero is possible, I would have to verify what would happen when trying to use the generated "infinity" value with the other FPU instructions before recommending that code.)

finit ;clear the FPU

fldz ;zero the summation

mov ecx,number_of_vectors ;use for count

lea eax,vector_array ;use for array pointer

@@:

fld1 ;load 1 on FPU

fdiv dword ptr ;get the reciprocal

add eax,4 ;adjust for next vector

fmul st,st(0) ;get the square of the reciprocal

;(much faster than another div by a memory value)

dec ecx

fadd ;add new result to summation

jnz @B ;FPU instructions don't affect CPU flags

;your sum result is now in st(0)

Raymond

(If a value of zero is possible, I would have to verify what would happen when trying to use the generated "infinity" value with the other FPU instructions before recommending that code.)

finit ;clear the FPU

fldz ;zero the summation

mov ecx,number_of_vectors ;use for count

lea eax,vector_array ;use for array pointer

@@:

fld1 ;load 1 on FPU

fdiv dword ptr ;get the reciprocal

add eax,4 ;adjust for next vector

fmul st,st(0) ;get the square of the reciprocal

;(much faster than another div by a memory value)

dec ecx

fadd ;add new result to summation

jnz @B ;FPU instructions don't affect CPU flags

;your sum result is now in st(0)

Raymond

Raymond, your code does not look faster than a direct translation of f0dder's algorithm into FPU instructions. Frankly, your code was exactly what I thought when I first replied to his posting.

Then, later, I realized that his algorithm was much faster because the main loop would contain

Then, later, I realized that his algorithm was much faster because the main loop would contain

**fmul**and**fadd**only. There is no way to overflow/undeflow an FPU register by his algorithm (with his intended single precision numbers) under 32-bit OS with 32-bit addres-space full of memory. So the numerical stability is secured, unless some pathological series of |x| values come in (including what you pointed out, |x| = 0).I checked out the code I suggested with one or more embedded 0's in an array. The code still works fine.

Just make sure to check the resulting sum for a value of INFINITY before you use it elsewhere.

Raymond

Just make sure to check the resulting sum for a value of INFINITY before you use it elsewhere.

Raymond

With f0dder's algo you can just check 'divisor'. If it's zero, the result is infinite and the division is not needed.

Do you know that this sounds like metaballs?

Do you know that this sounds like metaballs?

it is, gliptic. And the method I presented did give numerical instability - not with intels compiler, though. Guess one ought to have a look at the instructions produced.

F0dder used:

(1/a)+(b/c)=(c+a*b)/(a*c)

The problem is that your divisor grows faster than your dividend.

The only obvious way to solve the overflow problem is to normalize your v, predivide all your v by the max value of the v[] array.

JC

(1/a)+(b/c)=(c+a*b)/(a*c)

The problem is that your divisor grows faster than your dividend.

The only obvious way to solve the overflow problem is to normalize your v, predivide all your v by the max value of the v[] array.

JC

Could the vectors be quantized to, for example, 32-bit triplets? That would probably require 64-bit number processing though.

Another (dumb) idea.

You need to compute something like:

1/A + 1/B + 1/C + 1/D

This gives:

(A*B*C + A*B*D + A*C*D + B*C*D) / (A*B*C*D)

11 multiplications and 1 division instead of 4 divisions, and I think there is no overflow in this case.

We may reduce the number of divisions even further.

Suppose you have to add between 50 and 100 inverses, you can perhaps have routines for every case ?

If you don't know the maximum number of inverse to add, I guess this method is useless.

You need to compute something like:

1/A + 1/B + 1/C + 1/D

This gives:

(A*B*C + A*B*D + A*C*D + B*C*D) / (A*B*C*D)

11 multiplications and 1 division instead of 4 divisions, and I think there is no overflow in this case.

We may reduce the number of divisions even further.

Suppose you have to add between 50 and 100 inverses, you can perhaps have routines for every case ?

If you don't know the maximum number of inverse to add, I guess this method is useless.

No, it's not equivalent to the first code.

The first code is based on:

(1/a)+(b/c)=(c+a*b)/(a*c)

and dividend=c+a*b, divisor=a*c

The error cumulates, since c is bigger than a*b, thus you loose precision on the dividend.

In this case:

(A*B*C + A*B*D + A*C*D + B*C*D) / (A*B*C*D)

you don't loose precision.

BTW, another suggestion (which is useful when we are trying to multiply or divide):

why not using log ?

log(1/a)=-log(a)

If you can store the logs in a second table, you'll simply have to add them.

The first code is based on:

(1/a)+(b/c)=(c+a*b)/(a*c)

and dividend=c+a*b, divisor=a*c

The error cumulates, since c is bigger than a*b, thus you loose precision on the dividend.

In this case:

(A*B*C + A*B*D + A*C*D + B*C*D) / (A*B*C*D)

you don't loose precision.

BTW, another suggestion (which is useful when we are trying to multiply or divide):

why not using log ?

log(1/a)=-log(a)

If you can store the logs in a second table, you'll simply have to add them.

Ok, I found a way to make the computation stable.

Suppose f0dder's code is stable for 4 terms (this can be checked easily).

You can use the usual trick of subdivising your problem in smaller problems:

This gives 8 multiplications and 1 division for 4 values.

If you want better numerical stability, you can use 3 terms with:

1/a + 1/b + 1/c = (a*b+a*c+b*c)/(a*b*c)

This can be done in 4 multiplications and 1 division.

With 4 terms:

1/a+1/b+1/c+1/d=(a*b*c+a*b*d+a*c*d+b*c*d)/(a*b*c*d)

I'm curious to know: in how much multiplications can you compute that ?

Suppose f0dder's code is stable for 4 terms (this can be checked easily).

You can use the usual trick of subdivising your problem in smaller problems:

```
```

i=0;

result=0;

while(n>=4)

{

dividend = 1.0f, divisor = 1.0f;

repeat 4 times:

dist = v[i].v[i];

dividend *= dist;

dividend += divisor;

divisor *= dist;

++i;

end repeat

result+= dividend/divisor;

n-=4;

}

while(n)

{

dist=v[i]*v[i];

result += 1/dist;

++i;

--n;

}

This gives 8 multiplications and 1 division for 4 values.

If you want better numerical stability, you can use 3 terms with:

1/a + 1/b + 1/c = (a*b+a*c+b*c)/(a*b*c)

This can be done in 4 multiplications and 1 division.

With 4 terms:

1/a+1/b+1/c+1/d=(a*b*c+a*b*d+a*c*d+b*c*d)/(a*b*c*d)

I'm curious to know: in how much multiplications can you compute that ?

Ok, you can save one multiplication with:

1/a+1/b+1/c+1/d=(a*b*c+a*b*d+a*c*d+b*c*d)/(a*b*c*d)

if you pose:

K=a*b

L=c*d

(a*b*c+a*b*d+a*c*d+b*c*d)/(a*b*c*d)

=

(K*c+K*d+a*L+b*L)/(K*L)

=

(K*(c+d) + L*(a+b)) / (K*L)

5 multiplications, 2 additions and 1 division !

1/a+1/b+1/c+1/d=(a*b*c+a*b*d+a*c*d+b*c*d)/(a*b*c*d)

if you pose:

K=a*b

L=c*d

(a*b*c+a*b*d+a*c*d+b*c*d)/(a*b*c*d)

=

(K*c+K*d+a*L+b*L)/(K*L)

=

(K*(c+d) + L*(a+b)) / (K*L)

5 multiplications, 2 additions and 1 division !

Here's MCoders suggestion knocked up in Perl (it ain't a million miles from C code). I generalised the method suggested in the last few posts. 2 N Muls, N Adds and 1 Division :)

```
$size = scalar @numbers;
```

$num1 = @numbers[$idx-1] + @numbers[$idx-2];

$num2 = @numbers[$idx-1] * @numbers[$idx-2];

for ($idx = $size - 3; $idx >= 0; $idx = $idx - 1) {

$num1 = $num1 * @numbers[$idx] + $num2;

$num2 = $num2 * @numbers[$idx];

}

$ans = $num1/$num2;

Sorry, wasn't paying attention :rolleyes: