Hi everyone!  It's been a while since I've said much here because I've been hard at work on PwnIDE alpha and the video tutorial to go with it.  The tutorial walks through making a very CPU-intensive screen saver, which I'm almost done making, but I ran into a problem... my HP laptop died.  I managed to salvage the files from it using a USB harddrive enclosure, but my old laptop doesn't have SSE3, and that means that I can't do fast and simple things like:

``mulps   xmm0,xmm4   ;xmm0 = P0x*Q0x,P0y*Q0y,P0z*Q0z,P0a*Q0amulps   xmm1,xmm5   ;xmm1 = P1x*Q1x,P1y*Q1y,P1z*Q1z,P1a*Q1amulps   xmm2,xmm6   ;xmm2 = P2x*Q2x,P2y*Q2y,P2z*Q2z,P2a*Q2amulps   xmm3,xmm7   ;xmm3 = P3x*Q3x,P3y*Q3y,P3z*Q3z,P3a*Q3ahaddps  xmm0,xmm1   ;xmm0 = P0.Q0,P1.Q1,P2.Q2,P3.Q3haddps  xmm2,xmm3   ;haddps  xmm0,xmm2   ;``

There's always the brute force way of swizzling the data (equivalent to matrix transpose) after the multiplications, then doing 3 addps's, but that's definitely not the most efficient.  Any advice on good ways to approach this type of problem?

Not all cases need all 4 values; some only need the first three products to have meaningful values.  I think I've figured out a sufficiently fast way for the cases where I've got just a single dot product to do.  I'll keep the SSE3 versions of the functions in and do a CPUID check to see if the CPU can run them, else it'll run the slower versions.

Here's a screenshot I took before my laptop died, in case people are curious:

(Edit: Removed the img tag, because I attached the image.)
Posted on 2007-12-30 22:02:38 by hackulous
Here's a recent solution, and statement that haddps doesn't bring much improvement:
http://www.kvraudio.com/forum/viewtopic.php?p=2827383

For archival purposes, I'll quote the code:

method1)
``    MOVHLPS     XMM1,XMM0             ADDPS       XMM0,XMM1             MOVUPS      XMM1,XMM0    SHUFPS      XMM1,XMM1,\$55    ADDPS       XMM0,XMM1 ``

method2)
``The most efficient is probably to group them by chunks of 4:/*  return  */inline __m128 sum4(__m128 a, __m128 b, __m128 c, __m128 d) {    /* +        */    return _mm_add_ps(_mm_unpacklo_ps(s1,s2),_mm_unpackhi_ps( s1,s2));}``

Posted on 2007-12-31 07:16:18 by Ultrano

Here's a recent solution, and statement that haddps doesn't bring much improvement:
http://www.kvraudio.com/forum/viewtopic.php?p=2827383

Thanks!  :D
I've got a more detailed response on the MASM Forum.
Posted on 2007-12-31 12:31:56 by hackulous