I'm trying to find the fastest way to normalize a 3d vector. In my applicaton, I
don't have tight loops, and I must work on 3-float (non-aligned data). I'm attaching
what I've come up with, using SSE. The fastest version I've got is only barely faster
than a plain old C function. Can you do better?

The numbers I get on a P4-3 Ghz are :

NormalizeC : 83.856598 clocks
Normalize_ASM_Acc : 82.968558 clocks
Normalize_ASM_InAcc : 69.917598 clocks

Note that the "InAcc" version is much less accurate than the other two!!

Compiled with MSVC7
Posted on 2003-02-18 18:41:09 by cbloom
``````	mov	eax, pIn
mov	ecx, pOut

movups	xmm1, [eax]	// xmm0 = original
movaps	xmm0, xmm1
mulps	xmm1, xmm1	// xmm1 = squared

movaps	xmm2, xmm1	// @@ maybe not needed?
movaps	xmm3, xmm1	// @@ maybe not needed?

shufps	xmm2, xmm2, 1
shufps	xmm3, xmm3, 2

addss	xmm1, xmm3	// xmm1[0] = accumulated length square

rsqrtss	xmm1, xmm1	// xmm2 = inv sqrt

shufps	xmm1, xmm1, 0	// distribute

mulps	xmm0, xmm1	// multiply

movaps  [ecx], xmm0	// store to esp``````
I'd use a vec4 with vec[3]=0 just so I could use a single move. Given the forward dependancies and latency, two could be done for a little more than one. Sorry, not much help.
Posted on 2003-02-19 01:23:46 by bitRAKE
Inner product using SSE requires checking the input values. OTOH, FPU does not need it because FPU registers can hold DBL_MAX * DBL_MAX without problem.

I don't know if 3D modeling vectors are bounded by some small number. But, a general way to compute the complex absolute value usually involves input value checking in order not to overflow during inner product computation.
Posted on 2003-02-19 02:00:30 by Starless

Hi Charles,
Welcome on board. :alright:
Posted on 2003-02-19 03:41:05 by Maverick
Greets all!

Unfortunately, I can't use a Vec4, because I store these things in tight-packed structs. I've been considering having both, and letting them auto-convert, like you store a Vec3, but as soon as you start doing math on it, you get a Vec4 that's aligned, then you can store it back to a Vec3.

I also can't work on more than one at a time, since my Normalize is called over in random and complicated places.

I guess I'm rather dissapointed that I can't get any benefit from all this, that a float sqrt and divide aren't really hurting that much.
Posted on 2003-02-19 11:25:53 by cbloom
Dang MSVC7 does a good job.
It seems like you guys are doing alot of shuffling around
(I'm still using MSVC6 :( )
How did u get MSVC7 to use SSE? Can MSVC6 use SSE/3DNow! ?
Posted on 2003-02-19 20:15:22 by x86asm