I wrote the following ASM/MMX code to convert 16bit RGB565 into 24big RGB888.  After running this code I have realized there is something about the MOVQ command that I don't fully understand.  The original implementation of this routine was missing the "psrlq      mm0,16" line of code, and some of my color channels were getting obliterated.  When I include the "psrlq      mm0,16" line my color channel integrity in maintained, but my images gets slid to the left a pixel or two (hard to tell which just by visual inspection).  Before the "psrlq      mm0,16" the image always came out looking rather red.  I guess what I really need to know is what is happening (absolutely) when I "movq ,mm0".

For instance.  If edi = 100 and edx = 0, where, exactly, do the 64 bits from mm0 end up in memory?  I was assuming that the mm0 register would fill memory slots 100-108, but now I'm starting to guess that it's actually more like 92-100 (or something).

Any insight you can provide will be appreciated.

-Dave/Esotic


//dz| code in question
CQWORD mask_r= 0xf800f800f800f800;
CQWORD mask_g= 0x07e007e007e007e0;
CQWORD mask_b= 0x001f001f001f001f;
CQWORD mask_1= 0x00ff000000000000;
CQWORD mask_2= 0x000000ff00000000;
CQWORD mask_3= 0xffffffffffff0000;
CQWORD mask_4= 0xffffffffffffffff;

int XLOOP = XRES / 2;

_asm{
mov esi,pCurrentSource //esi <- Pointer to RGB565 Memory
mov edi,pCurrentOutput //edi < -Pointer to RGB88 Memory
xor ebx,ebx //ebx <- 0 (ebx ^ ebx)
xor        edx,edx
mov ecx,XLOOP //ecx <- X resolution / 2, as we're doing 2 24bit pixels at a time
RGB565TO888_X:
movq mm0, //mm0 <- source
movq mm3,
movq        mm1,mm0
movq        mm4,mm3
movq        mm2,mm0
movq        mm5,mm3

pand        mm0,mask_r //r
pand        mm3,mask_r //r
pand        mm1,mask_g //g
pand        mm4,mask_g //g
pand        mm2,mask_b //b
pand        mm5,mask_b //b

psrlw      mm0,8      //r
psrlw      mm3,13   //r
psrlw      mm1,3      //g
psrlw      mm4,9      //g
psllw      mm2,3      //b
psrlw      mm5,2      //b

por        mm0,mm3 //r
por        mm1,mm4 //g
por        mm2,mm5 //b

movq        mm3,mm0 //r
movq        mm4,mm1 //g
movq        mm5,mm2 //b

pand        mm0,mask_1
pand        mm1,mask_1
pand        mm2,mask_1

pand        mm3,mask_2
pand        mm4,mask_2
pand        mm5,mask_2

psllq      mm0,8
psrlq      mm2,8

psrlq      mm4,8
psrlq      mm5,16

por        mm0,mm1
por        mm0,mm2
por        mm0,mm3
por        mm0,mm4
por        mm0,mm5

pand        mm0,mask_3
psrlq      mm0,16 // <- Line Needed to Maintain Color Channel Integrity

//--------------------
movq ,mm0 //?? <- mm0 ??

//movq        ,mm7

//movq        mm7,mask_3
//psrlq      mm7,16

//movq ,mm7

add ebx,4 // ebx <- 2 16bit Pixels
add        edx,6          //move 2x24BPP
//--------------------
// X̃??[v????
dec ecx // ecx <- ecx-1
jnz RGB565TO888_X
emms
}

}
Posted on 2005-09-28 08:48:00 by Esotic
If you are using movq, you are moving 64bit. If you are using movd, you are moving 32bit. So all 64 bit is copied to your memory since you are using movq.

Can you give me more information on 16bit RGB 565 and 24bit RGB888? Maybe I could be of some help.
Posted on 2005-09-28 09:30:08 by roticv

I understand the difference between moving a Quadword and a Word, but what I'm wondering is where EXACTLY do those bits go into memory?

RGB565 is 16Bit color where there are 5 red bits, 6 green bits, and 5 blue bits RRRRRGGG GGGBBBBB, so 0xFFFF = white RGB565 Pixel

RGB888 is 24Bit color where there are 8 red bits, 8 green bits, and 8 blue bits RRRRRRRR GGGGGGGG BBBBBBBB, so 0xFFFFFF = white RGB888 Pixel

I was under the assumption that I could expand the two RGB565 pixels residing in the 4 high order bytes of the MM register into two RGB888 pixels that would take up the 6 highest order bytes, then write those bytes out to which would get incremented by 6 each loop as I only really wanted the 6 high order bytes.  But color channels kept getting obliterated and as a test I started shifting the 6 bytes of RGB888 into the lowest order BYTES of the MM register and now my color channels are ok, but the image gets shifted to the left by a pixel or two.  I know there is something about how the MOVQ command is writing to memory that I don't understand, I'm just not sure what it is.  :-\

Thank,

-Dave/Esotic
Posted on 2005-09-28 09:43:57 by Esotic
Not that I fully understand why this works, but I was to "fix" the code so that it works withour error (or so it would appear).  The trick is to start edx at 6 (just got lucky trying any dumb thing).  "mov        edx,0x06" 

Just in case anyone out there needs to convert 565to888.


//dz| do nothing for now
CQWORD mask_r= 0xf800f800f800f800;
CQWORD mask_g= 0x07e007e007e007e0;
CQWORD mask_b= 0x001f001f001f001f;
CQWORD mask_1= 0x00ff000000000000;
CQWORD mask_2= 0x000000ff00000000;
CQWORD mask_3= 0xffffffffffff0000;
CQWORD mask_4= 0xffffffffffffffff;
//CQWORD six = 0xffffffffffffffff;

int XLOOP = XRES / 2;

_asm{
mov esi,pCurrentSource //esi <- s_ptr
mov edi,pCurrentOutput //edi < -d_ptr
xor ebx,ebx //ebx <- 0 (ebx ^ ebx)
xor        edx,edx
mov        edx,0x06
mov ecx,XLOOP //ecx <- 80
RGB565TO888_X:
movq mm0, //mm0 <- source
movq mm3,
movq        mm1,mm0
movq        mm4,mm3
movq        mm2,mm0
movq        mm5,mm3

pand        mm0,mask_r //r
pand        mm3,mask_r //r
pand        mm1,mask_g //g
pand        mm4,mask_g //g
pand        mm2,mask_b //b
pand        mm5,mask_b //b

psrlw      mm0,8      //r
psrlw      mm3,13   //r
psrlw      mm1,3      //g
psrlw      mm4,9      //g
psllw      mm2,3      //b
psrlw      mm5,2      //b

por        mm0,mm3 //r
por        mm1,mm4 //g
por        mm2,mm5 //b

movq        mm3,mm0 //r
movq        mm4,mm1 //g
movq        mm5,mm2 //b

pand        mm0,mask_1
pand        mm1,mask_1
pand        mm2,mask_1

pand        mm3,mask_2
pand        mm4,mask_2
pand        mm5,mask_2

psllq      mm0,8
psrlq      mm2,8

psrlq      mm4,8
psrlq      mm5,16

por        mm0,mm1
por        mm0,mm2
por        mm0,mm3
por        mm0,mm4
por        mm0,mm5

//pand        mm0,mask_3
psrlq      mm0,16 // <- Line Needed to Maintain Color Channel Integrity

//--------------------
movq ,mm0 //?? <- mm0 ??o??f??[^??????

//movq        ,mm7

//movq        mm7,mask_3
//psrlq      mm7,16

//movq ,mm7

add ebx,4 // ebx <- ebx+8
add        edx,6          //move 2x24BPP
//--------------------
// X̃??[v????
dec ecx // ecx <- ecx-1
jnz RGB565TO888_X
emms
}

}
Posted on 2005-09-28 11:32:38 by Esotic
I've also tried converting 565 to 888 with mmx, and the result is that mmx is useless here. It's possible to make an mmx converter, but it's ten times slower than using normal instructions.

;---[16->24]--------\
mov dx,
mov al,dl
shl al,3
mov ,al
shr dx,5
mov al,dl
shl al,2
mov ,al
shr dx,6
shl dl,3
mov ,dl
;-------------------/

Using ebx and ecx, you can convert a second pixel, in parallel (on newer AMD cpus).
Also, you could do a trick or two to increase write-bandwidth (because now writing a byte requires reading of 8bytes, internally to the cpu). That is, if the cpu's write-queue buffer isn't advanced.
1) use a stack-based array to write one scanline, then in the end do aligned-copy to the output frame
2) combine two pixels, then write 6 bytes
3) use mmx registers as temporary vars to hold 8 converted pixels, and then write 8-byte aligned data to framebuffer.
Posted on 2005-09-28 11:44:53 by Ultrano
Thanks you, Ultrano, for the ASM lesson.

The code you posted is faster, but not by a factor of 10.  Your code gets 3 more FPS under heavy load, but I think that counts for something.

After removing some of the superflous shifting in my MMX version (low order bits) I was hitting 18FPS, while your code was hitting 19FPS.  What I am wondering now is how to use mm6 and mm7 to store the other 2 pixels worth of data that is generated at the beginning of my loop and then discarded.  I'll be sure to post my findings.

You wouldn't happen to have a library of graphics related ASM code you'd like to share, would you?  :)

I would kinda expect that to be online somewhere.

Thanks,

-Dave/Esotic
Posted on 2005-09-28 12:31:13 by Esotic

ok, i officially give up trying to make that MMX any faster.  you win.

:)

-Dave/Esotic
Posted on 2005-09-28 12:50:40 by Esotic
Hmmm... now I'm trying to convert in the other direction and am having trouble


mov esi,pCurrentSource //esi <- s_ptr
mov edi,pCurrentOutput //edi < -d_ptr
mov ecx,XRES //ecx <- 720
RGB888TO565_X2:
mov dx,
mov ah,dh
and ah,0xF8
and dx,0x00FF
shl dx,5
or  ax,dx
mov dx,
and dx,0xFF00
shr dx,11
or  ax,dx
mov ,ax
add esi,3
add edi,2         
dec ecx
jnz RGB888TO565_X2


Posted on 2005-09-28 13:19:07 by Esotic

Here's the finalized working code



_asm{
mov esi,pCurrentSource //esi <- s_ptr
mov edi,pCurrentOutput //edi < -d_ptr
mov ecx,XRES //ecx <- 80
RGB888TO565_X2:
mov al, //r
//mov al,dl
and al,0xF8
shl ax,8
mov dl, //g
and dx,0x00FC
shl dx,3
or  ax,dx
mov dl, //b
//and dl,0xF8
shr dl,3
or  ax,dx
mov ,ax
//shl al,3
//mov ,al
//shr dx,5
//mov al,dl
//shl al,2
//mov ,al
//shr dx,6
//shl dl,3
//mov ,dl
add esi,3 // ebx <- ebx+8
add        edi,2          //move 2x24BPP
//--------------------
// X̃??[v????
dec ecx // ecx <- ecx-1
jnz RGB888TO565_X2
//emms
}
}



I gotta learn to start using the debugger before posting my bug-ridden code.  :)
Posted on 2005-09-28 14:38:13 by Esotic
The third optimization trick I came up (converting in blocks of 8 pixels, using mmx for write) got only a 10% speedup, doing one pixel in 12 cycles on my PC (AthlonXP 2000+, 400MHz DDR):

Hihi proc uses eax ebx ecx edx esi edi pDest,pSrc,numPix
mov esi,pSrc
mov edi,pDest

.data
mask0 dq  0FCF8F8FCF8F8FCF8h
mask1 dq  0F8F8FCF8F8FCF8F8h
mask2 dq  0F8FCF8F8FCF8F8FCh

.code


.while numPix>=8
mov edx,
mov eax,
mov ebx,
mov ecx,
;------[ 2 pixels ]--------\
mov ecx,edx

movzx eax,dx
shl dx,5
shl eax,8
shr ax,5
mov ah,dh

mov ebx,ecx
shl cx,5
shr ebx,8
shr bx,5
mov bh,ch

movd mm0,eax
movd mm1,ebx
;--------------------------/

;------[ 2 pixels ]--------\
mov edx,dword ptr
mov ecx,edx

movzx eax,dx
shl dx,5
shl eax,8
shr ax,5
mov ah,dh

mov ebx,ecx
shl cx,5
shr ebx,8
shr bx,5
mov bh,ch

movd mm2,eax
movd mm3,ebx
;--------------------------/

;------[ 2 pixels ]--------\
mov edx,dword ptr
mov ecx,edx

movzx eax,dx
shl dx,5
shl eax,8
shr ax,5
mov ah,dh

mov ebx,ecx
shl cx,5
shr ebx,8
shr bx,5
mov bh,ch

movd mm4,eax
movd mm5,ebx
;--------------------------/

;------[ 2 pixels ]--------\
mov edx,dword ptr
mov ecx,edx

movzx eax,dx
shl dx,5
shl eax,8
shr ax,5
mov ah,dh

mov ebx,ecx
shl cx,5
shr ebx,8
shr bx,5
mov bh,ch

movd mm6,eax
movd mm7,ebx
;--------------------------/

; data is
;3(mm0)+3(mm1)+2(mm2)
;1(mm2)+3(mm3)+3(mm4)+1(mm5)
;2(mm5)+3(mm6)+3(mm7)

psllq mm1,24
por mm0,mm1
movq mm1,mm2
psllq mm1,48
por mm0,mm1

psrlq mm2,16
psllq mm3,8
psllq mm4,32
por mm2,mm3
por mm2,mm4
movq mm1,mm5
psllq mm1,56
por mm2,mm1

psrlq mm5,8
psllq mm6,16
psllq mm7,40
por mm5,mm6
por mm5,mm7


pand mm0,mask0  ; you can remove these, for 1 cycle speedup
pand mm2,mask1 ; tradeoff is that red+=green/32 (max error=7/256)
pand mm5,mask2

movq qword ptr,mm0
movq qword ptr,mm2
movq qword ptr,mm5




add esi,16
add edi,24
sub numPix,8
.endw

.while numPix
;---[16->24]--------\
mov dx,
mov al,dl
shl al,3
mov ,al
shr dx,5
mov al,dl
shl al,2
mov ,al
shr dx,6
shl dl,3
mov ,dl
add esi,2
add edi,3
;-------------------/
dec numPix
.endw




ret
Hihi endp


If we remove the mask, we save another cycle per pixel, but the red channel will have 2.7% extra brigtness, depending on green channel. Try improving the mmx instructions' overlapping to save 1-2 cycles per 8-pix block on Intel cpus.
I tried pipelining the regular instructions, but it didn't give any extra speed, so I reverted to the readable sequence of the code (2-pixel blocks' conversion)
Posted on 2005-09-28 19:05:57 by Ultrano
On x86, I have almost no code for graphics. But for my PalmOS games, I've made an arsenal of procs - in C and ARM9 asm. Not procedural, just texture/sprite-based.


edit: :shock: I've made a mistake in the benchmark results - it turned out the optimization is actually 26%, instead of 10% , and each pixel takes 9 cycles to convert  ^^' .
Posted on 2005-09-28 19:12:29 by Ultrano