in some gaphcis code i am writing, i am writing functions to rearrange RGB (in a 32 bit ARGB where A doesn't matter at all)
i plan on writing 27 routies for each combination of RGB whether it is RRR, GBR etc.. so that each technique can be the fastest possible.. (if somebody can tell me of an elegant way that i wouldn't have to write so much code go ahead please :) )

anyhow my first (non MMX attempt was doing a pixel was this) (this is 0RGB to 0RRR )

mov eax,
mov ebx,eax
shr ebx,16 ; ebx now = 0/0/A/R/
mov bh,bl ;ebx now = 0/0/R/R
xor ax,ax ; strip out the green and blue (any faster way?)
or eax,ebx ;combine to make /A/R/R/R
mov , eax

i don't know if that is the most optomised, i'd be interested in peoples perspective.. also are single byte moves like mov bh,bl slow technique in modern processes or is that fine?

any my first attempt at it in mmx is

mov ecx, numpixels
shr ecx, 1 ;//since we process 2 pixels at a time
mov esi, pSource
mov edi, pDest


movd mm0, redmask
movq mm2,mm0
psllq mm2, 32
por mm2,mm0 ; //double up the mask into 64 bits so it can be applied to 2 pixels at a time
//MM2 contains mask 00FF0000 00FF0000 (basically when added, it just retains the red
loopperRRR:
movq mm0, ; //grab the 2 pixels ARGB ARGB
pand mm0,mm2 ; //now mm0 just contains 2 pixels with just the red channel so 0R00 0R00
movq mm1,mm0 ; //mm1 contains a copy of that
psrld mm1,8 ; //move the red in both pixels into the green so mm1 contains 00R0 00R0
por mm1,mm0 ; //combine with original so mm1 contains 0RR0 0RR0
psrld mm0,16 //take original and shift red into the blue position so mm0 is 000R 000R
por mm1,mm0 //combine for our result of 0RRR 0RRR

//pxor mm1,mm1
movq ,mm1
add edi, 8
add esi, 8

dec ecx
jnz loopperRRR

mov ecx, numpixels
and ecx, 1 ;//since we process 2 pixels at a time. if its an odd number then we got to do the last pixel
//otherwise we skip it


jz skipOddpixelRRR
mov eax,
mov ebx,eax
shr ebx,16
mov bh,bl
xor ax,ax
or eax,ebx
mov , eax
emms

skipOddpixelRRR:
}
Posted on 2004-06-02 00:41:36 by klumsy
are single byte moves like mov bh,bl slow technique in modern processes or is that fine?


Yes, if you write to a partial register, like bh here, the CPU will internally allocate a new register for it. Then if you access a larger version later, bx or ebx, then it has to combine that new register with the old register, which causes a pipeline flush, so a rather large penalty.
Posted on 2004-06-02 01:47:11 by Scali
so would it be quicker acutally to have more instructions, i.e to mov the whole ebx register to another, so a shift and then add them back together than to do the mov,bh,bl

what about the MMX version

movq mm0, ; //grab the 2 pixels ARGB ARGB
pand mm0,mm2 ; //now mm0 just contains 2 pixels with just the red channel so 0R00 0R00
movq mm1,mm0 ; //mm1 contains a copy of that
psrld mm1,8 ; //move the red in both pixels into the green so mm1 contains 00R0 00R0
por mm1,mm0 ; //combine with original so mm1 contains 0RR0 0RR0
psrld mm0,16 //take original and shift red into the blue position so mm0 is 000R 000R
por mm1,mm0 //combine for our result of 0RRR 0RRR
//pxor mm1,mm1
movq ,mm1

do you think there would be a faster way, maybe with some of those funkcy pack and unpack routines? (they give me headaches) i can't see uses for them other than packing and unpacking 15,16 bit graphics which i don't care for , working fully in 32 bit
Posted on 2004-06-02 17:31:13 by klumsy
A couple of ideas:

MMX


mask dq 00ff000000ff0000h

movq mm0,[esi] ; mm0 <- 00RRGGBB 00RRGGBB
pand mm0,mask ; mm0 <- 00RR0000 00RR0000
movq mm1,mm0
psrld mm1,8 ; mm1 <- 0000RR00 0000RR00
por mm0,mm1 ; mm0 <- 00RRRR00 00RRRR00
psrld mm1,8 ; mm1 <- 000000RR 000000RR
por mm0,mm1 ; mm0 <- 00RRRRRR 00RRRRRR


I don't think pack/unpack is any use here but maybe it can be done..

SSE2 version:




align 16
mask dq 00ff000000ff0000h
dq 00ff000000ff0000h

movdqa xmm0,[esi] ; change to movdqu if data isn't aligned on 16-byte boundary
pshufhw xmm1,xmm0,11111111b ; xmm1 hi <- 00RR00RR00RR00RR
pshuflw xmm1,xmm0,11111111b ; xmm1 lo <- 00RR00RR00RR00RR
movdqa xmm0,xmm1
psllw xmm0,8
por xmm0,xmm1
pand xmm0,mask ; could be removed if you don't care about the top byte not being 0
movntdq [edi],xmm0


Neither has been tested but should be ok..
Posted on 2004-06-02 19:41:36 by stormix
thanks for the advice, do you know in msvc++ how to specify a variable to be dq? (like you do in your mask.. i had my code double it up because i couldn't work out how to do that.
if your first example, would it be any more efficent than my code? just curious.. i just see its just different with the shifting 8 both times rather than 16 on one..

i've been reading uptoday about SSE, and SSE2 and have been thinking its the way to go..
also with those SSE commands i wouldn't have to write 27 different versions of optomised routines for the different RGB combinations but rather just mess with the shuffle masks, which is hugely advantaged.
Posted on 2004-06-02 21:40:30 by klumsy
for the MMX version, i have decided to give up on writing 27 different routines optomisd for each combination but rather try to write one routine that can rearange to any combination..

my idea is that for each channel, i will have to shift the input left or right by so much, then apply a mask to filter out just that channel, then use or to combine the 3 channels.. the amount for shifting can be calculated outside the loop so that is no problem

so the RedMask would be 00FF000000FF0000
so the GreenMask would be 0000FF000000FF00
so the BlueMask would be 000000FF000000FF

for working on the rightchannel.. i'd have to take the input
if it is Red, i would do a shift left by 0,
if blue, shifg left by 8
iif green , shift left, 16

so those amounts would be precalculated outside of the loop called something like

REDSHIFTLEFTAMOUNT = (above logic)

for the green channel, sometimes it has to be shifted left and right so i think i might have to do both

so if green is the destination then
GREENSHIFTLEFTAMOUNT = 0 ,GREENSHIFTRIGHTAMOUNT = 0
if red
GREENSHIFTRIGHTAMOUNT=0,GREENSHIFTRIGHTAMOUNT = 8
if blue
GREENSHIFTRIGHTAMOUNT = 8, GREENSHIFTRIGhTAMOUNT = 0

and for blue position
if red
BLUESHIFTRIGHTAMOUNT = 16
if green
BLUESHIFTRIGHTAMOUNT = 8
if blue
BLUESHIFTRIGHTAMOUNT = 0


then inside the loop you would do something like

redchannel = (original << REDSHIFTLEFTAMOUNT) & REDMASK
bluechannel = (original >> BLUESHIFTRIGHTAMOUNT) & BLUEMASK
greenchannel = ((original << GREENSHIFTLEFTAMOUNT)>>GREENSHIFTRIGHTAMOUNT) & BLUEMASK
endpixel = redchannel | bluechannel | greenchannel

now i go and try to implement it in MMX... man i love the look of that SSE2 though
Posted on 2004-06-02 22:53:38 by klumsy
here is my first attempt at implementing the algorithm above

__int64 redleftshiftamount = (((int)m_fRedChannel)-1)* 8; // so red 0, green 8, blue 16
__int64 bluerightshiftamount = (3 -(int)m_fBlueChannel) * 8;
__int64 greenleftshiftamount = 0;
__int64 greenrightshiftamount = 0;
if (m_fGreenChannel == 1) greenrightshiftamount = 8;
if (m_fGreenChannel == 3) greenleftshiftamount = 8;
__int64 redmask = 0x00FF000000FF0000;
__int64 greenmask = 0x0000FF000000FF00;
__int64 bluemask = 0x000000FF000000FF;

pDest = (DWORD*)pOutput->GetBuffer();

__asm {

mov esi, pSource
mov edi, pDest

mov ecx, numpixels
shr ecx, 1

movq mm5, redmask ;// mm5 = redmask
movq mm6, greenmask ;// mm6 = greenmask;
movq mm7, bluemask;//mm7
movq mm3,greenleftshiftamount;//green left shift amount
movq mm4,greenrightshiftamount;//green right shift amount


AllLooper:

movq mm0, ;//original byte (RED Channel)
movq mm1,mm0 ;//MM1 = green channel
pslld mm0,redleftshiftamount;//move whatever channel should be into red channel position
movq mm2,mm1 ;//mm2 = blue channel
pslld mm1,mm3 ; //half of the green channel move into positsion
pand mm0,mm5 ;//mm0 is now complete red channel (redmask has been applied)
psrld mm1,mm4 ;//the other half of moving into the green channel
psrld mm2,bluerightshiftamount ;//move blue channel into correct place
pand mm1,mm6 ;//mm1 is now complete green channel
pand mm2,mm7 ;//blue is now complete
por mm0,mm1
por mm0,mm2
movq ,mm0

add edi, 8
add esi, 8

dec ecx
jnz AllLooper
emms
}

i don't know where to start from optomising it here.. i tried putting the bluerightshiftamount into eax, and doin the psrld mm2,eax but it didn't produce the
right results. i'm targing specically p2,p3 with this (cause i'll probably make a SSE/2 one for newer ones).. but for now it is also p4.. i tried to arrange the instruction the way i would in the olden days some.. but i think that may be pointless as newer pentiums will do that anyway?
Posted on 2004-06-03 03:01:15 by klumsy