in MMX i often use packed shifts to shift Rgb around and such

however it applies the samedegree of shift on all bytes, words etc
in SSE or SSE2 is there a shift command that will shift each part diferents

i.e in shifting byte sized if you passed in

pshllbCOOL mm1, 01020304h

it would shift one byte by one , another by 2, another by 3 and another by 4
any such thing in existance. i really don't want to have to seperate the channels and apply and mask them seperately.
Posted on 2004-06-11 17:58:53 by klumsy
You can do that with the multiply instructions. ;)
Posted on 2004-06-11 18:50:00 by bitRAKE
I think he wants to shift different bytes, not the same byte...
Posted on 2004-06-11 19:08:34 by Sephiroth3
Yes, and packed mul will still allow you to do that. You can specify a separate multiplier for every word.
Posted on 2004-06-11 19:14:38 by Scali
thats an interesting technique and way of thinking about it
but would multiplication be more expensive than seperating them , anding and oring and shifting each channel?
Posted on 2004-06-11 20:05:34 by klumsy
A mul is a very fast operation. Downside is that it only works on words, so you need to unpack and repack. But that's probably still a lot faster than processing them all separately.
Posted on 2004-06-12 05:21:37 by Scali
i must be too old school, i remember mul being a slow dog
so mul will take care of the equiv of shift left
but not shift right (unless i mul it by a certian amount (different for each byte) and then shr it by a fixed amount or something..
Posted on 2004-06-12 05:39:14 by klumsy
i must be too old school, i remember mul being a slow dog


Integer Mul is slow, at least on P4, and on P3/Athlon it is still slower than fmul or pmul/mulps/etc.
But this is not integer, so different rules apply :)
It doesn't hurt to look up or measure timings for instructions if you don't know them. Your assumptions might be wrong, and the assembly you spent all that time on 'optimizing' might turn out slow after all, because you chose the wrong instructions.

The shift is not required, you can do a pmul and get the upper word instead of the lower word. Consider the upper word as the entire dword result, but shifted right by 16. So you can shift right by (n-16) when you multiply by 2^n. So effectively shift left for n < 16.
Posted on 2004-06-12 06:53:41 by Scali
sounds cool, i think i'll have to draw it on paper to get it though. what i am actually doing is a routine that rotates the bits for each colour ARGB

for my one that rotates all the bits through the whole RGB (RATHER than this one through each color channel.. i have 2 copies of it. with one copy i shift it right
and the other left, then por them back together.. it works well..

movq mm4, shiftleftamount;
movq mm5, shiftrightamount;
movq mm6, getalphamask;
movq mm7, getcolormask;



AllLooper:

movq mm0, ;//original byte (RED Channel)
movq mm1,mm0 ;//mm1 color, mm0 alpha
pand mm0,mm6
pand mm1,mm7 ;//mm1 color to shift left
movq mm2,mm1 ;//mm2 color to shift right
pslld mm1,mm4
psrld mm2,mm5
por mm1,mm2 ;//combine the shifted parts
por mm0,mm1 ;//combine with alpha


//PADDUSB mm0,mm2
//psrld mm0,24

movq ,mm0

add edi, 8
add esi, 8

dec ecx
jnz AllLooper




however for this one since there is no byte multiplication ,i'll have to process each pixel individually (though of course i can do that inside one loop with one movq to read and another to write both pixels

however since to rotate i have to move it in both directions, each pixel would require 2 multiplications (and thus 4 in the inner loop) i wish MMX had a rotate function.
Posted on 2004-06-12 19:19:20 by klumsy
movq mm0, ;//= original
movq mm1,mm0 ;//red
pand mm1 ,redmask
movq mm2 ,mm1 ;
pslld mm1 ,redshiftleftamount
psrld mm2 ,redshiftrightamount
por mm1,mm2 ;//mm1 is red
pand mm1, redmask;

movq mm3,mm0
pand mm3,greenmask
movq mm4,mm3
pslld mm3 ,greenshiftleftamount
psrld mm4 ,greenshiftrightamount
por mm3,mm4 ;//mm3 is green
pand mm3,greenmask;

movq mm5,mm0
pand mm5,bluemask
movq mm6,mm5
pslld mm5 ,blueshiftleftamount
psrld mm6 ,blueshiftrightamount
por mm5,mm6 ;//mm5 is blue
pand mm5,bluemask;

pand mm0,alphamask ;//mm0 is alpha
por mm1,mm3 ;red and green combined
por mm0,mm5 ;alpha and blue combined
por mm1,mm0 ;all combined

movq ,mm1



above is my non multiplication versin, just with shifts (it oculd be optomised more, i just have it this way for simplificty and readibility)
SCALI ,do you still think that a MMX multiplication solution would be faster? if so i'll have a go at that?
also what tools can i use to measure preformance, and is there a really good files/sheets about time different instructions take?
Posted on 2004-06-13 21:47:56 by klumsy

Yes, and packed mul will still allow you to do that. You can specify a separate multiplier for every word.


so you mean, like Multiply by 2, 4, 8 etc? i thought such a thing would be quite slow compared to using shifts
and what about for shifting in the opposite direction, shifting right?
Posted on 2004-06-25 19:28:20 by klumsy
i just did a quick google search on mmx shifting because there is no pdiv or anything like that iirc from my former Wintel days. Here's what i came up with http://www.udayton.edu/~cps/cps560/notes/hardware/mmx/Intel/dg_chp2.htm. As for being slow, the pmul instruction can in theory output one chunck of data every clock cycle, but you need to wait for three cycles before picking up your data (otherwise, you'll cause a stall). i think that's like using the FPU, which i don't use. in fact, iirc, the FPU and MMX registers are physically the same and need to be explicitly cleared if you're using both in the same program, incurring mega-wrath from the Clock Cycle King.

edit: fixed up link. thanks, bitRAKE :)
Posted on 2004-06-25 21:08:18 by jademtech
{FYI: the board puts the period at the end of the sentence into the hyperlink URL which equates to a bad address}
Posted on 2004-06-25 22:23:43 by bitRAKE