Hi guys,
I was converting the following C function to some Asm SSE code. You might be interested in it, I know I'd be interested in any suggestions to improve it. Heres the C function,
double L2_norm(vec,dim)

double *vec;
short dim;
int i;
double t_sum=0.0;

for (i=0; i<dim; i++)
t_sum += vec[i]*vec[i];


First I wrote a literal translation
proc L2_NormSlow,pV,sV

mov edx,[pV]
mov ecx,[sV]
add edx,[sV]
neg ecx

xorps xmm0,xmm0
.sseLp: movss xmm1,[edx+ecx]
mulss xmm1,xmm1
addss xmm0,xmm1
add ecx,4
js .sseLp

sqrtss xmm0,xmm0

Ugly I know but I really just wanted something that worked so I could test my answers. Then I worte this much longer version,
proc VectorLoop,pV,sV

mov ecx,[sV]
mov edx,[pV]

add edx,ecx
neg ecx

[COLOR=green]; Initial processing[/COLOR]

add ecx,32
jg .nops

[COLOR=green]; Pre 8 value loop processing[/COLOR]

.psLp: [COLOR=green]; 8 value loop[/COLOR]

[COLOR=green]; 1st 4 -> [edx+ecx-32][/COLOR]
[COLOR=green]; 2nd 4 -> [edx+ecx-16][/COLOR]

add ecx,32
jng .psLp

[COLOR=green]; Post 8 value loop processing[/COLOR]

.nops: sub ecx,32
jz .exit

add ecx,8
jg .noss

[COLOR=green]; Pre 2 value loop processing[/COLOR]

.ssLp: [COLOR=green]; 2 value loop[/COLOR]

[COLOR=green]; 1st -> [edx+ecx-8][/COLOR]
[COLOR=green]; 2nd -> [edx+ecx-4][/COLOR]

add ecx,8
jng .ssLp

[COLOR=green]; Post 2 value loop processing[/COLOR]

.noss: sub ecx,8
jz .exit

[COLOR=green]; Final value processing[/COLOR]

[COLOR=green]; value -> [edx-4][/COLOR]

.exit: [COLOR=green]; Final processing[/COLOR]

For large (approx 128mb) vectors the longer version runs in about 75% the time of the shorter. For small (approx 1kb) vectors the longer version runs in < 25% the time of the shorter. Any opinions?
Posted on 2003-07-04 12:25:07 by Eóin
E?in, might not gain much more speed, but I'd unroll the loop until all instruction latencies are hidden within other instructions.

mulps xmm4,xmm4
mulps xmm5,xmm5

addps xmm0,xmm2
addps xmm1,xmm3

movaps xmm2,[edx+ecx-16*4]
movaps xmm3,[edx+ecx-16*3]

addps xmm0,xmm4
addps xmm1,xmm5

movaps xmm4,[edx+ecx-16*2]
movaps xmm5,[edx+ecx-16*1]

mulps xmm2,xmm2
mulps xmm3,xmm3

add ecx,16*4
jng .psLp
...I'm just guessing from memory. Sorry, hope you get the idea? Look at the latencies for each instruction - this will be a greater boost for cached data. Check to see if it's already memory bound on 128MB's - if so, you've reached the limit on your machine. :)
Posted on 2003-07-07 14:19:03 by bitRAKE

Now that you use SSE, you can safely assume that prefetch family of instructions are available. That may help speeding up in a (relatively) long loop. It seems to me that something longer than 128 bytes may benefit from prefetch instructions. (But, this is purely from my experience and not an experiment result. The threshold may be lower or higher if you experiment rigorously.)

For short vectors (like remaining elements after the main loop in your code), it may be faster to use an unrolled loop of FPU instructions. For an example of FPU version of l_2 norm, see this post.

Minor points:
1. The current version may overflow even if the final value is representable in single precision. (And, this is a part of the reason why I'm not a big fan of SSE for numerical analysis.) You know, the usual trick is to divide each element by the maximum element and then accumulate. If memory is much slower than the division operation, the way that BLAS dnrm2 is implemented may be faster.

2. I suppose that sV is not the same as dim in the C version, otherwise you need shl at the beginning of the function.
Posted on 2003-07-07 18:51:44 by Starless
bitRAKE, I think I follow what you're saying. That example you give certainly is faster, down to about 15% & 65% for large and small respectivly. I'll try and get myself a list of those latencies. Thanks,

Starless, I don't really understand the prefetch instruction, I seem to get an illegal instrucion.
Posted on 2003-07-08 17:29:37 by Eóin