Hi.
I am trying to make a RGB24 fill routine. I implemented an aligned 32-bit version, which fills it in 12-byte steps (96-bits is divisible by both 24 and 32).
It works by filling up registers like so:
The result of that above procedure becomes:
After which I can simply set eax -> , ecx -> , and edx->. The left-over bytes (those not less than 12-bytes=96-bits in length) can be filled manually with a single-byte loop (kinda like a traditional memcpy with both rep movsd and rep movsb).
Now, I want to implement the same using MMX and SSE.
For MMX, we will have to use three 64-bit registers: MM2 as edx, MM1 as ecx, and MM0 as eax. This time the loop will fill the bitmap in 192-bit steps (lowest multiple of 64 that is also divisible by 24).
The registers shoud look like this:
I am having trouble filling up these registers. In the above 32-bit version, I used a double-precision shift to continously feed in a new RGB triplet through the set of registers. However, no double-precision shift exists for MMX. I could go for a very ugly move/shift sequence of instructions, where I set each register individually. However, I would like to see someone come up with something straightforward and as algorithmic as the double-precision shift feeder of the 32-bit version.
Leave the SSE version for later, as it is just another extension of the same idea but with quadruple the magnitude.
I am trying to make a RGB24 fill routine. I implemented an aligned 32-bit version, which fills it in 12-byte steps (96-bits is divisible by both 24 and 32).
It works by filling up registers like so:
xor edx,edx
xor ecx,ecx
xor eax,eax
repeat 4
shld edx,ecx,24
shld ecx,eax,24
shl eax,24
or eax, ; RGB triplet value (0x00RRGGBB)
end repeat
The result of that above procedure becomes:
edx ecx eax
RGBR GBRG BRGB
After which I can simply set eax -> , ecx -> , and edx->. The left-over bytes (those not less than 12-bytes=96-bits in length) can be filled manually with a single-byte loop (kinda like a traditional memcpy with both rep movsd and rep movsb).
Now, I want to implement the same using MMX and SSE.
For MMX, we will have to use three 64-bit registers: MM2 as edx, MM1 as ecx, and MM0 as eax. This time the loop will fill the bitmap in 192-bit steps (lowest multiple of 64 that is also divisible by 24).
The registers shoud look like this:
MM2 MM1 MM0
RGBRGBRG BRGBRGBR GBRGBRGB
I am having trouble filling up these registers. In the above 32-bit version, I used a double-precision shift to continously feed in a new RGB triplet through the set of registers. However, no double-precision shift exists for MMX. I could go for a very ugly move/shift sequence of instructions, where I set each register individually. However, I would like to see someone come up with something straightforward and as algorithmic as the double-precision shift feeder of the 32-bit version.
Leave the SSE version for later, as it is just another extension of the same idea but with quadruple the magnitude.
Interesting,
I'm not really familiar with RGB filling but yoru snippets show you are distributing the same 24bits(3bytes) over 3 32bit registers?
Doing this with mmx or xmmx would take a lot of messy code.
Well with good use of the PINSRW (packed insert word) opcode it would be possible.
Ok holds the 24bits [17h] 00rrggbb [14h]
You could use the same method with
MOVD and then PSHUFD, but that would be even more opcodes than the above garbage
xmm0
bytes: 01 23 45 67 01 23 45 67
rgb24: RG BR GB RG BR GB RG BR
xmm1
bytes: 01 23 45 67 01 23 45 67
rgb24: GB RG BR GB RG BR GB RG
xmm2
bytes: 01 23 45 67 01 23 45 67
rgb24: BR GB RG BR GB RG BR GB
My guess is the non SSE version might be a little faster.
I'm not really familiar with RGB filling but yoru snippets show you are distributing the same 24bits(3bytes) over 3 32bit registers?
Doing this with mmx or xmmx would take a lot of messy code.
Well with good use of the PINSRW (packed insert word) opcode it would be possible.
Ok holds the 24bits [17h] 00rrggbb [14h]
mov al, byte
mov ah, byte ;AX = RG
mov cl, byte
mov ch, byte ;CX = BR
mov dl, byte
mov dh, byte ;DX = GB
;spam xmm0
PINSRW xmm0,cx,0
PINSRW xmm0,ax,1
PINSRW xmm0,dx,2
PINSRW xmm0,cx,3
PINSRW xmm0,ax,4
PINSRW xmm0,dx,5
PINSRW xmm0,cx,6
PINSRW xmm0,ax,7
;spam xmm1
...
;spam xmm2
...
You could use the same method with
MOVD and then PSHUFD, but that would be even more opcodes than the above garbage
xmm0
bytes: 01 23 45 67 01 23 45 67
rgb24: RG BR GB RG BR GB RG BR
xmm1
bytes: 01 23 45 67 01 23 45 67
rgb24: GB RG BR GB RG BR GB RG
xmm2
bytes: 01 23 45 67 01 23 45 67
rgb24: BR GB RG BR GB RG BR GB
My guess is the non SSE version might be a little faster.
Why not have 3 qwords on stack, fill in the apropriate R/G/B byte values to the memory locations, and then just load the 3 mm registers from that memory ?
You can also expand the code to use 6 mm registers, for 16-pixel fills
rgbMMXfiller struct
x0 QWORD ?
x1 QWORD ?
x2 QWORD ?
reserved dd ? ; against garbage on stack
rgbMMXfiller ends
local rf:rgbMMXfiller
Clear rf ; zeroes-out the structure
mov eax,RGBcolor
lea edx,rf
or ,eax
or ,eax
...
or ,eax ; aww, I just got up, so I messed up this value 3 times, on last edit it is correct ;)
movq mm0,rf.x0
movq mm1,rf.x1
movq mm2,rf.x2
; and now fill with these registers
You can also expand the code to use 6 mm registers, for 16-pixel fills