Hi.

I am trying to make a RGB24 fill routine. I implemented an aligned 32-bit version, which fills it in 12-byte steps (96-bits is divisible by both 24 and 32).

It works by filling up registers like so:

        xor    edx,edx
        xor    ecx,ecx
        xor    eax,eax

    repeat 4
        shld    edx,ecx,24
        shld    ecx,eax,24
        shl    eax,24
        or      eax,  ; RGB triplet value (0x00RRGGBB)
    end repeat


The result of that above procedure becomes:


  edx    ecx    eax
  RGBR  GBRG  BRGB


After which I can simply set eax -> , ecx -> , and edx->. The left-over bytes (those not less than 12-bytes=96-bits in length) can be filled manually with a single-byte loop (kinda like a traditional memcpy with both rep movsd and rep movsb).

Now, I want to implement the same using MMX and SSE.
For MMX, we will have to use three 64-bit registers: MM2 as edx, MM1 as ecx, and MM0 as eax. This time the loop will fill the bitmap in 192-bit steps (lowest multiple of 64 that is also divisible by 24).
The registers shoud look like this:


  MM2        MM1        MM0
RGBRGBRG  BRGBRGBR  GBRGBRGB


I am having trouble filling up these registers. In the above 32-bit version, I used a double-precision shift to continously feed in a new RGB triplet through the set of registers. However, no double-precision shift exists for MMX. I could go for a very ugly move/shift sequence of instructions, where I set each register individually. However, I would like to see someone come up with something straightforward and as algorithmic as the double-precision shift feeder of the 32-bit version.

Leave the SSE version for later, as it is just another extension of the same idea but with quadruple the magnitude.
Posted on 2005-07-21 07:54:11 by comrade
Interesting,
I'm not really familiar with RGB filling but yoru snippets show you are distributing the same 24bits(3bytes) over 3 32bit registers?

Doing this with mmx or xmmx would take a lot of messy code.
Well with good use of the PINSRW (packed insert word) opcode it would be possible.
Ok holds the 24bits [17h] 00rrggbb [14h]

mov al, byte
mov ah, byte ;AX = RG
mov cl, byte
mov ch, byte ;CX = BR
mov dl, byte
mov dh, byte ;DX = GB
;spam xmm0
PINSRW xmm0,cx,0
PINSRW xmm0,ax,1
PINSRW xmm0,dx,2
PINSRW xmm0,cx,3
PINSRW xmm0,ax,4
PINSRW xmm0,dx,5
PINSRW xmm0,cx,6
PINSRW xmm0,ax,7
;spam xmm1
...
;spam xmm2
...

You could use the same method with
MOVD and then PSHUFD, but that would be even more opcodes than the above garbage

xmm0
bytes: 01 23 45 67 01 23 45 67
rgb24: RG BR GB RG BR GB RG BR
xmm1
bytes: 01 23 45 67 01 23 45 67
rgb24: GB RG BR GB RG BR GB RG
xmm2
bytes: 01 23 45 67 01 23 45 67
rgb24: BR GB RG BR GB RG BR GB

My guess is the non SSE version might be a little faster.
Posted on 2005-07-25 00:19:09 by r22
Why not have 3 qwords on stack, fill in the apropriate R/G/B byte values to the memory locations, and then just load the 3 mm registers from that memory ?

rgbMMXfiller struct
x0 QWORD ?
x1 QWORD ?
x2 QWORD ?
reserved dd ? ; against garbage on stack
rgbMMXfiller ends

local rf:rgbMMXfiller
Clear rf ; zeroes-out the structure
mov eax,RGBcolor
lea edx,rf
or ,eax
or ,eax
...
or ,eax ; aww, I just got up, so I messed up this value 3 times, on last edit it is correct ;)
movq mm0,rf.x0
movq mm1,rf.x1
movq mm2,rf.x2

; and now fill with these registers

You can also expand the code to use  6 mm registers, for 16-pixel fills
Posted on 2005-07-25 08:48:30 by Ultrano