I'm trying to find the fastest way to normalize a 3d vector. In my applicaton, I

don't have tight loops, and I must work on 3-float (non-aligned data). I'm attaching

what I've come up with, using SSE. The fastest version I've got is only barely faster

than a plain old C function. Can you do better?

The numbers I get on a P4-3 Ghz are :

NormalizeC : 83.856598 clocks

Normalize_ASM_Acc : 82.968558 clocks

Normalize_ASM_InAcc : 69.917598 clocks

Note that the "InAcc" version is much less accurate than the other two!!

Compiled with MSVC7

don't have tight loops, and I must work on 3-float (non-aligned data). I'm attaching

what I've come up with, using SSE. The fastest version I've got is only barely faster

than a plain old C function. Can you do better?

The numbers I get on a P4-3 Ghz are :

NormalizeC : 83.856598 clocks

Normalize_ASM_Acc : 82.968558 clocks

Normalize_ASM_InAcc : 69.917598 clocks

Note that the "InAcc" version is much less accurate than the other two!!

Compiled with MSVC7

```
mov eax, pIn
```

mov ecx, pOut

movups xmm1, [eax] // xmm0 = original

movaps xmm0, xmm1

mulps xmm1, xmm1 // xmm1 = squared

movaps xmm2, xmm1 // @@ maybe not needed?

movaps xmm3, xmm1 // @@ maybe not needed?

shufps xmm2, xmm2, 1

shufps xmm3, xmm3, 2

addss xmm1, xmm2

addss xmm1, xmm3 // xmm1[0] = accumulated length square

rsqrtss xmm1, xmm1 // xmm2 = inv sqrt

shufps xmm1, xmm1, 0 // distribute

mulps xmm0, xmm1 // multiply

movaps [ecx], xmm0 // store to esp

I'd use a vec4 with vec[3]=0 just so I could use a single move. Given the forward dependancies and latency, two could be done for a little more than one. Sorry, not much help.Inner product using SSE requires checking the input values. OTOH, FPU does not need it because FPU registers can hold DBL_MAX * DBL_MAX without problem.

I don't know if 3D modeling vectors are bounded by some small number. But, a general way to compute the complex absolute value usually involves input value checking in order not to overflow during inner product computation.

I don't know if 3D modeling vectors are bounded by some small number. But, a general way to compute the complex absolute value usually involves input value checking in order not to overflow during inner product computation.

Hi Charles,

Welcome on board. :alright:

Greets all!

Unfortunately, I can't use a Vec4, because I store these things in tight-packed structs. I've been considering having both, and letting them auto-convert, like you store a Vec3, but as soon as you start doing math on it, you get a Vec4 that's aligned, then you can store it back to a Vec3.

I also can't work on more than one at a time, since my Normalize is called over in random and complicated places.

I guess I'm rather dissapointed that I can't get any benefit from all this, that a float sqrt and divide aren't really hurting that much.

Unfortunately, I can't use a Vec4, because I store these things in tight-packed structs. I've been considering having both, and letting them auto-convert, like you store a Vec3, but as soon as you start doing math on it, you get a Vec4 that's aligned, then you can store it back to a Vec3.

I also can't work on more than one at a time, since my Normalize is called over in random and complicated places.

I guess I'm rather dissapointed that I can't get any benefit from all this, that a float sqrt and divide aren't really hurting that much.

Dang MSVC7 does a good job.

It seems like you guys are doing alot of shuffling around

(I'm still using MSVC6 :( )

How did u get MSVC7 to use SSE? Can MSVC6 use SSE/3DNow! ?

It seems like you guys are doing alot of shuffling around

(I'm still using MSVC6 :( )

How did u get MSVC7 to use SSE? Can MSVC6 use SSE/3DNow! ?