Hi,

As I?ve continued to develop my project (see First Foray into ASM) I?ve got a bit side tracked into looking at colour Matrices. I?ve written and optimised the C++ version and whilst I?m pretty happy with the results I?d like to look at implementing a MMX version. Sadly I know very little MMX and have not really found many good sources online. I?ve got the basic idea, but the wealth of instructions, packing, unpacking etc has left me a little dizzy.

Colour Matrix is simply the application of matrices to manipulate each pixel in a given image. The original source code for this used floating point (I know I can look into SEE/SSE2 but I?d like to work my way up to these) and as such was pretty slow, though very accurate. I then converted this to an integer version (using the methodology of fixed point, but not caring where the point is ;) - well the code worked that?s all that was important at the time), this was about 40% faster. I then converted this to a lookup table version which gave almost 60% performance on the original FP version.

I?ve posted the fixed point version here to act as a basis for the MMX version, since I believe it?s the most appropriate.

As I?ve continued to develop my project (see First Foray into ASM) I?ve got a bit side tracked into looking at colour Matrices. I?ve written and optimised the C++ version and whilst I?m pretty happy with the results I?d like to look at implementing a MMX version. Sadly I know very little MMX and have not really found many good sources online. I?ve got the basic idea, but the wealth of instructions, packing, unpacking etc has left me a little dizzy.

Colour Matrix is simply the application of matrices to manipulate each pixel in a given image. The original source code for this used floating point (I know I can look into SEE/SSE2 but I?d like to work my way up to these) and as such was pretty slow, though very accurate. I then converted this to an integer version (using the methodology of fixed point, but not caring where the point is ;) - well the code worked that?s all that was important at the time), this was about 40% faster. I then converted this to a lookup table version which gave almost 60% performance on the original FP version.

I?ve posted the fixed point version here to act as a basis for the MMX version, since I believe it?s the most appropriate.

`void TStdXtra_IMoaMmXScript::ncp_ColourMatrixImage_FixedPoint(unsigned long* src, unsigned long* dst, MoaUlong uiWidth, MoaUlong uiHeight, MoaDouble * mMat)`

{

MoaLong iRed, iGreen, iBlue;

MoaLong ir, ig, ib;

MoaLong newMat[16];

MoaUlong ui1, i;

MoaUlong iImageSize = uiWidth*uiHeight;

// Convert Matrix to fixedpoint using *256

for (i=0; i<16; i++)

newMat* = (MoaLong)(mMat***256.0f);*

for(i=0;i<iImageSize;i++)

{

// This appears to be fastest emthod of grabing the components

ui1 = *src++;

ir = (MoaLong)((ui1 >> 16)&0xFF);

ig = (MoaLong)((ui1 >> 8) &0xFF);

ib = (MoaLong)((ui1) &0xFF);

// Use fixed point matrix values - have to divide through at end

iRed = (ir*newMat[0] + ig*newMat[4] + ib*newMat[8] + newMat[12]) / 256;

iGreen = (ir*newMat[1] + ig*newMat[5] + ib*newMat[9] + newMat[13]) / 256;

iBlue = (ir*newMat[2] + ig*newMat[6] + ib*newMat[10] + newMat[14]) / 256;

// bound checks - yuk! C < 0 = 0 C > 255 = 255

// < snipped for shorter code and it should be removed by using MMX >

*dst++ = (unsigned long)( (byte)(iBlue) | ((byte)(iGreen) << 8) | ((byte)(iRed) << 16));

}

}

So from my understanding of MMX so far, I can set up the maths as such

Which in the case of the red componet resolves to

For the moment I?m ignoring alpha and so m[12]*1 represents the translation of the colour component. At some stage I?ll introduce alpha and do the component translation later.

So the question how to go about this with real MMX code?

Ideally using VC2005 intrinsic file for defines to use MMX without going to full asm, I can move to pure asm afterwards.

I?m guessing I need to use shorts for matrix and colour component values and then pack those into MMX registers. Use MMX to do the multiply and ADD although should that include saturation at this point? Since I need to add the two dword results together to get the final new component colour?

I need to then extract the result and place it into a byte in the destination.

Just looking for some pointers on how to start this, thanks.

Oh any benifit do you think in re-arranging the matrix from Row to Column order? That way for each component the Matrix access will be sequential, as in red = m[0] m[1] m[2] m[3] instead of m[0] m[4] m[8] m[12]. I don't think it will afffect the ASM as Vc2005 uses offsets via , but perhaps its better for the cahce ?So from my understanding of MMX so far, I can set up the maths as such

// MMX Words a c e g

// * * * *

// Words b d f h

// Result dWord a*b+c*d e*f+g*h

Which in the case of the red componet resolves to

// Load mm1 with matrix m[0] m[4] m[8] m[12]

// Load mm2 with Components ir ig ib 1

// Multiply mm1 with mm2

// Results m[0]*ir+m[4]*ig m[8]*ib + m[12]*1

For the moment I?m ignoring alpha and so m[12]*1 represents the translation of the colour component. At some stage I?ll introduce alpha and do the component translation later.

So the question how to go about this with real MMX code?

Ideally using VC2005 intrinsic file for defines to use MMX without going to full asm, I can move to pure asm afterwards.

I?m guessing I need to use shorts for matrix and colour component values and then pack those into MMX registers. Use MMX to do the multiply and ADD although should that include saturation at this point? Since I need to add the two dword results together to get the final new component colour?

I need to then extract the result and place it into a byte in the destination.

Just looking for some pointers on how to start this, thanks.

Oh any benifit do you think in re-arranging the matrix from Row to Column order? That way for each component the Matrix access will be sequential, as in red = m[0] m[1] m[2] m[3] instead of m[0] m[4] m[8] m[12]. I don't think it will afffect the ASM as Vc2005 uses offsets via , but perhaps its better for the cahce ?

* *