Hi everyone!  It's been a while since I've said much here because I've been hard at work on PwnIDE alpha and the video tutorial to go with it.  The tutorial walks through making a very CPU-intensive screen saver, which I'm almost done making, but I ran into a problem... my HP laptop died.  I managed to salvage the files from it using a USB harddrive enclosure, but my old laptop doesn't have SSE3, and that means that I can't do fast and simple things like:


mulps   xmm0,xmm4   ;xmm0 = P0x*Q0x,P0y*Q0y,P0z*Q0z,P0a*Q0a
mulps   xmm1,xmm5   ;xmm1 = P1x*Q1x,P1y*Q1y,P1z*Q1z,P1a*Q1a
mulps   xmm2,xmm6   ;xmm2 = P2x*Q2x,P2y*Q2y,P2z*Q2z,P2a*Q2a
mulps   xmm3,xmm7   ;xmm3 = P3x*Q3x,P3y*Q3y,P3z*Q3z,P3a*Q3a
haddps  xmm0,xmm1   ;xmm0 = P0.Q0,P1.Q1,P2.Q2,P3.Q3
haddps  xmm2,xmm3   ;
haddps  xmm0,xmm2   ;


There's always the brute force way of swizzling the data (equivalent to matrix transpose) after the multiplications, then doing 3 addps's, but that's definitely not the most efficient.  Any advice on good ways to approach this type of problem?

Not all cases need all 4 values; some only need the first three products to have meaningful values.  I think I've figured out a sufficiently fast way for the cases where I've got just a single dot product to do.  I'll keep the SSE3 versions of the functions in and do a CPUID check to see if the CPU can run them, else it'll run the slower versions.

Here's a screenshot I took before my laptop died, in case people are curious:

(Edit: Removed the img tag, because I attached the image.)
Attachments:
Posted on 2007-12-30 22:02:38 by hackulous
Here's a recent solution, and statement that haddps doesn't bring much improvement:
http://www.kvraudio.com/forum/viewtopic.php?p=2827383

For archival purposes, I'll quote the code:

method1)

    MOVHLPS    XMM1,XMM0       
    ADDPS      XMM0,XMM1       
    MOVUPS      XMM1,XMM0
    SHUFPS      XMM1,XMM1,$55
    ADDPS      XMM0,XMM1


method2)

The most efficient is probably to group them by chunks of 4:

/*  return */
inline __m128 sum4(__m128 a, __m128 b, __m128 c, __m128 d) {
    /* +
      */
    return _mm_add_ps(_mm_unpacklo_ps(s1,s2),_mm_unpackhi_ps( s1,s2));
}

Posted on 2007-12-31 07:16:18 by Ultrano

Here's a recent solution, and statement that haddps doesn't bring much improvement:
http://www.kvraudio.com/forum/viewtopic.php?p=2827383

Thanks!  :D
I've got a more detailed response on the MASM Forum.
Posted on 2007-12-31 12:31:56 by hackulous