Can the following code be optimized any. It simply adds two 4x4 matrices together.

__m128 row0, row1, row2, row3;
__m128 base0, base1, base2, base3;
__m128 result0, result1, result2, result3;

row0 = _mm_load_ps(m_fMatrix16);
row1 = _mm_load_ps(m_fMatrix16+4);
row2 = _mm_load_ps(m_fMatrix16+8);
row3 = _mm_load_ps(m_fMatrix16+12);

base0 = _mm_load_ps(mat.m_fMatrix16);
base1 = _mm_load_ps(mat.m_fMatrix16+4);
base2 = _mm_load_ps(mat.m_fMatrix16+8);
base3 = _mm_load_ps(mat.m_fMatrix16+12);

result0 = _mm_add_ps(row0, base0);
result1 = _mm_add_ps(row1, base1);
result2 = _mm_add_ps(row2, base2);
result3 = _mm_add_ps(row3, base3);

_mm_store_ps(matResult.m_fMatrix16, result0);

Thanks for any help,
Posted on 2006-06-24 21:12:04 by exorcist_bob
Yes it can.
By avoiding the 2nd set of loads.

SSE allows you to add a xmmx register with a block of memory. In most cases it's slightly faster to do it this way, especially when you are only using the values in the memory once. This is especially useful when the memory is 16byte aligned.

movdqa xmm0,      ;row 1
movdqa xmm1, ;row 2
movdqa xmm2, ;row 3
movdqa xmm3, ;row 4
addps xmm0,
addps xmm1,
addps xmm2,
addps xmm3,
movdqa ,xmm0
movdqa ,xmm1
movdqa ,xmm2
movdqa ,xmm3

Posted on 2006-06-24 22:09:16 by r22
Now I am getting a syntax error:

error C2415: improper operand type

on the first four movdqa's. Doesn't it know that it can load four floats from that array? :|

Thanks for your help,
Posted on 2006-06-25 07:54:03 by exorcist_bob