I'm trying to find the fastest way to normalize a 3d vector. In my applicaton, I
don't have tight loops, and I must work on 3-float (non-aligned data). I'm attaching
what I've come up with, using SSE. The fastest version I've got is only barely faster
than a plain old C function. Can you do better?

The numbers I get on a P4-3 Ghz are :

NormalizeC : 83.856598 clocks
Normalize_ASM_Acc : 82.968558 clocks
Normalize_ASM_InAcc : 69.917598 clocks

Note that the "InAcc" version is much less accurate than the other two!!

Compiled with MSVC7
Posted on 2003-02-18 18:41:09 by cbloom
	mov	eax, pIn

mov ecx, pOut

movups xmm1, [eax] // xmm0 = original
movaps xmm0, xmm1
mulps xmm1, xmm1 // xmm1 = squared

movaps xmm2, xmm1 // @@ maybe not needed?
movaps xmm3, xmm1 // @@ maybe not needed?

shufps xmm2, xmm2, 1
shufps xmm3, xmm3, 2

addss xmm1, xmm2
addss xmm1, xmm3 // xmm1[0] = accumulated length square

rsqrtss xmm1, xmm1 // xmm2 = inv sqrt

shufps xmm1, xmm1, 0 // distribute

mulps xmm0, xmm1 // multiply

movaps [ecx], xmm0 // store to esp
I'd use a vec4 with vec[3]=0 just so I could use a single move. Given the forward dependancies and latency, two could be done for a little more than one. Sorry, not much help.
Posted on 2003-02-19 01:23:46 by bitRAKE
Inner product using SSE requires checking the input values. OTOH, FPU does not need it because FPU registers can hold DBL_MAX * DBL_MAX without problem.

I don't know if 3D modeling vectors are bounded by some small number. But, a general way to compute the complex absolute value usually involves input value checking in order not to overflow during inner product computation.
Posted on 2003-02-19 02:00:30 by Starless

Hi Charles,
Welcome on board. :alright:
Posted on 2003-02-19 03:41:05 by Maverick
Greets all!

Unfortunately, I can't use a Vec4, because I store these things in tight-packed structs. I've been considering having both, and letting them auto-convert, like you store a Vec3, but as soon as you start doing math on it, you get a Vec4 that's aligned, then you can store it back to a Vec3.

I also can't work on more than one at a time, since my Normalize is called over in random and complicated places.

I guess I'm rather dissapointed that I can't get any benefit from all this, that a float sqrt and divide aren't really hurting that much.
Posted on 2003-02-19 11:25:53 by cbloom
Dang MSVC7 does a good job.
It seems like you guys are doing alot of shuffling around
(I'm still using MSVC6 :( )
How did u get MSVC7 to use SSE? Can MSVC6 use SSE/3DNow! ?
Posted on 2003-02-19 20:15:22 by x86asm