What is "Only loose 4 bits - instead of 6" mean?
This means that one extra bit of each source is preserved, but would only be noticable is certain situations: performing several blend operations on a primary colored gradient. Remember we are talking about 24 million colors here and the two extra preserved bits are least-significant.
Posted on 2002-05-05 13:58:32 by bitRAKE
The purpose of the shifting is allow the signed multiply to use all the bits of the alpha as unsigned data - each resulting word of the alpha registers (registers MM4/5 in the algo above) is transformed into ALPHA*128. This leaves the top bit clear - preserving the sign of (DEST - SOURCE) for the signed addition.

This code is untested and you can easily unroll the code to do four pixel in one loop for more speed. :)
SolidAlphaBlend PROC buff:DWORD, len:DWORD, color:DWORD

mov eax,buff
mov ecx,len ; number of pixels
shr ecx,1
dec ecx

pxor mm7,mm7

movd mm6,color ; ....ARGB
movq mm4,mm6 ; ....ARGB
; punpckldq mm6,mm6 ; ARGBARGB ; no need for this! ;)

psrlw mm4,1 ; ....VW..
punpcklwd mm4,mm4 ; VWVW....
punpckhdq mm4,mm4 ; VWVWVWVW
; mm4 = {each word alpha*128}
; movq mm3,mm6 ; code fat be gone... ;)
; punpcklbw mm1,mm7 ; no need for this :)
punpcklbw mm6,mm7 ; low bytes! :P

; mm3 = unpacked color
; mm7 = 0
@@:
movq mm0,[eax+ecx*8]
dec ecx
movq mm2,mm0 ; FEDCBA98
punpcklbw mm0,mm7 ; .E.C.A.8
punpckhbw mm2,mm7 ; .F.D.B.9
psubsw mm0,mm6
psubsw mm2,mm6
psllw mm0,1
psllw mm2,1
pmulhw mm0,mm4
pmulhw mm2,mm4
paddsw mm0,mm6
paddsw mm2,mm6
packuswb mm0,mm2
movq [eax+ecx*8],mm0
jnz @B

ret
SolidAlphaBlend ENDP
The explaination is hard to do in few words - would be better to read the other thread where the algo developed (HERE). You'll also see E?in's work on the algo and he presents an alternate for non-MMX CPU's at the top.
Posted on 2002-05-05 14:08:36 by bitRAKE
Thank you for your willingness to help. I think i now understand and will walk thu what i think is happening correct me if this is wrong (but i do get a solution, so i think i got it):
SolidAlphaBlend PROC buff:DWORD, len:DWORD, color:DWORD

mov eax,buff
mov ecx,len ; number of pixels
shr ecx,1
dec ecx

pxor mm7,mm7

movd mm6,color ; ....ARGB
movq mm4,mm6 ; ....ARGB
punpckldq mm6,mm6 ; ARGBARGB

psrlw mm4,1 ; ....VW..
Divide the unpacked data by two to get effectively Alpha*256/2 + Blue*2, or the 128*Alpha your getting at.


punpcklwd mm4,mm4 ; VWVW.... punpckhdq mm4,mm4 ; VWVWVWVW
; mm4 = {each word alpha*128}

This *is* 128 times the A char + Blue/2, now copied into all four word locations.


movq mm3,mm6
[b] punpcklbw mm1,mm7[/b] ;??????
punpckhbw mm3,mm7

; mm3 = unpacked color
; mm7 = 0

mm1 is undefined to start with, but the high bytes of each word is now formatted to 0. mm3 is the same but now: .A.B.G.R, ok.


@@:
movq mm0,[eax+ecx*8]
dec ecx
movq mm2,mm0 ; FEDCBA98
punpcklbw mm0,mm7 ; .E.C.A.8
punpckhbw mm2,mm7 ; .F.D.B.9
psubsw mm0,mm3
psubsw mm2,mm3

Ok, 8 bytes (two pixels) are read, and unpacked into words over two mmx registers. Then the difference is found between each unpacked values (A-a),(B-b),(G-g),(R-r) for the two pixels vs. the set blend 'rgb' color.


psllw mm0,1
psllw mm2,1

Now multiply each unpacked difference word by 2 thru the two pixels. This is setting up for the upcomming code.


pmulhw mm0,mm4
pmulhw mm2,mm4

Ok, now each unpacked word component for ARGB of each pixel is multiplied by:
[ 2*(R-r) ] *[ 128*Alpha + Blue/2 ] == [(R-r)*Alpha*256 + (R-r)*Blue].

This command also takes the upper word of the dword result. This totaly ignores the (R-r)*Blue because 2^8 * 2^8 = 2^16 and is dropped.

Also, if alpha == 256 the result is just (R-r). If alpha is 0, then its 0. Alpha = 128 its (R-r)/2. And this is applied evenly thru all components of both pixels. As i see now, this is how a percentage is found. With a resolution of 1/256% alpha per step between 0->255.


paddsw mm0,mm3
paddsw mm2,mm3

Now the %(R-r) is added to the 'r' blend color to start with. Effectively adding a percentage of the difference of the two colors as I stated in the earlier equation: D + Alpha%(S-D) for alpha blend.


packuswb mm0,mm2
movq [eax+ecx*8],mm0
jnz @B
ret
SolidAlphaBlend ENDP

Repack the two pixels A'B'G'R'A'B'G'R', and save them. Then loop onto the next two pixels.

Thanx I think i got it.... Umm i dont think the line punpcklbw mm1,mm7 is used or needed tho?

Thanx again bitRake and Eoin!
:alright:
NaN
Posted on 2002-05-05 17:59:03 by NaN
Posted on 2002-05-05 18:03:20 by NaN
Sorry, I do that a lot because I don't like making a thousand posts. Also, note after reading your post I made three corrections to the code. You seem to have a good grasp of it - way to go! Look at the rigisters with some test data in Ollydbg is the best way, imho.
Posted on 2002-05-05 18:09:39 by bitRAKE
Never used Ollydbg. Will have to check it out.


Errors that crashed my machine *again* gave me reason to get Ollydbg sooner than i thought. :) * I like the User Interface, but wish you can close a file without exiting, so i can recompile :rolleyes: .

Anyways, with its help i saw an error we have both overlooked. Well actually two, from the same source: the way the memory is being called and saved to. You're decrementing backwards in memory (which is ok), but you start with 8 bytes beyond the bitmap boundry, and finish 8 bytes too soon when you dec/jnz in a loop. As well, the more serious problem was that the source bytes are not 1:1 to the destination bytes, since ECX was being decremented before the MMX algo and its save point (This is what crashed the machine ~ hard ;) )

So here is my fix to your source and works well now.
SolidAlphaBlend PROC buff:DWORD, len:DWORD, color:DWORD

mov eax,buff
mov ecx,len ; number of pixels
shr ecx,1
dec ecx

pxor mm7,mm7

movd mm6,color ; ....ARGB
movq mm4,mm6 ; ....ARGB

psrlw mm4,1 ; ....VW..
punpcklwd mm4,mm4 ; VWVW....
punpckhdq mm4,mm4 ; VWVWVWVW
punpcklbw mm6,mm7 ; low bytes! :P

xor edx, edx
.while (edx < ecx )
movq mm0,[eax+edx*8]
movq mm2,mm0 ; FEDCBA98
punpcklbw mm0,mm7 ; .E.C.A.8
punpckhbw mm2,mm7 ; .F.D.B.9
psubsw mm0,mm6
psubsw mm2,mm6
psllw mm0,1
psllw mm2,1
pmulhw mm0,mm4
pmulhw mm2,mm4
paddsw mm0,mm6
paddsw mm2,mm6
packuswb mm0,mm2
movq [eax+edx*8],mm0
inc edx
.endw

ret
SolidAlphaBlend ENDP


Thanx again bitRAKE for all your help!
:alright:
NaN
Posted on 2002-05-05 20:01:58 by NaN
Here is the fixed loop.
SolidAlphaBlend PROC buff:DWORD, len:DWORD, color:DWORD

mov eax,buff
mov ecx,len ; number of bytes
shr ecx,3 ; (4 bytes/pixel) / (2 bytes/loop)

pxor mm7,mm7

movd mm6,color ; ....ARGB
movq mm4,mm6 ; ....ARGB
punpcklbw mm6,mm7

psrlw mm4,1 ; ....VW..
punpcklwd mm4,mm4 ; VWVW....
punpckhdq mm4,mm4 ; VWVWVWVW
jmp _1

ALIGN 8

_0: movq mm0,[eax+ecx*8]
movq mm2,mm0 ; FEDCBA98
punpcklbw mm0,mm7 ; .E.C.A.8
punpckhbw mm2,mm7 ; .F.D.B.9
psubsw mm0,mm6
psubsw mm2,mm6
psllw mm0,1
psllw mm2,1
pmulhw mm0,mm4
pmulhw mm2,mm4
paddsw mm0,mm6
paddsw mm2,mm6
packuswb mm0,mm2
movq [eax+ecx*8],mm0
_1: dec ecx
jns _0

ret
SolidAlphaBlend ENDP
Unroll and interleave two loops (using two more MMX registers), then unroll again for 8 pixels in one loop, and add prefetch instruction, then your really cooking...
Posted on 2002-05-05 21:47:31 by bitRAKE
Thanx bitRAKE, but i dont follow this 'unroll' business.

The is 'spare' mmx reg's, sure. But it dont see how fetching 16 bytes and doing the process two times over in one loop, is any fanster (relatively) than fetching 8 bytes, and one process.

They both loop, and since the MMX reg field is only 8 bytes, after this your saturated for performance savings, right? I mean, another instruction is another instruction, weather i do two times in a loop, or one times in a loop, the same number of executions are the same??

Thanx.
:NaN:
Posted on 2002-05-06 10:50:24 by NaN
There are forward dependancies and my cacheline size is 64 bytes. You could not unroll, but you'd need an inner/outer loop. This is strange territory as processors become very different on what they need in the code: manual prefetch, prefetch instruction, or nothing (auto-prefetch Athlon XP/P4).
Posted on 2002-05-06 10:59:19 by bitRAKE
Ah yes, that evil *transparent* cashe ;)

I forgot about this. Thanx!
Posted on 2002-05-06 11:03:24 by NaN
On faster CPU's there will be no performance gain without prefetch - that is why FSB (front side bus) speed is so important - most software doesn't prefetch. So, the CPU is stuck pulling data from main memory into --> L2 cache --> L1 cache. P4 even has an L3 cache, iirc. :eek:
Posted on 2002-05-06 11:16:32 by bitRAKE
hello..
i try to use DoAlpha() proc, but i get a white window :(
I've debuged, and api return values are ok. So probably the code is bad in some api param.


[b]WM_CREATE[/b]
invoke LoadBitmap,hInstance,addr BitmapName
mov hBitmap,eax
(...)
[b]WM_PAINT[/b]
invoke BeginPaint,hWnd,addr ps
mov hdc,eax
CALL DoAlpha
invoke EndPaint,hWnd,addr ps


Now DoAlpha() code


DoAlpha PROC USES esi edi
;============================================
LOCAL bm :BITMAP
LOCAL bmi :BITMAPINFO
LOCAL lpBits :DWORD
LOCAL SDC :DWORD
LOCAL OldBM :DWORD
LOCAL hBm:DWORD
LOCAL BMDC :DWORD
LOCAL hMemDC:DWORD

invoke CreateCompatibleDC,hdc
mov hMemDC,eax

invoke GetObject, hBitmap, sizeof BITMAP, addr bm

invoke RtlZeroMemory, addr bmi.bmiHeader, sizeof BITMAPINFOHEADER

mov eax, sizeof BITMAPINFOHEADER
mov bmi.bmiHeader.biSize, eax
mov eax, bm.bmWidth
mov bmi.bmiHeader.biWidth, eax
mov eax, bm.bmHeight
neg eax
mov bmi.bmiHeader.biHeight, eax
mov bmi.bmiHeader.biPlanes, 1
mov bmi.bmiHeader.biCompression, BI_RGB
mov bmi.bmiHeader.biBitCount, 32

invoke CreateDIBSection, hMemDC,addr bmi, DIB_RGB_COLORS, addr lpBits, NULL, NULL

mov hBm, eax ;LOCAL handle

invoke SelectObject,hMemDC,eax

invoke BitBlt, hMemDC, 0,0,250,250,hdc, 0,0, SRCCOPY

; Do Alpha.
mask_50_alpha_24 equ 0111111101111111011111111b
xor edx, edx
mov eax, bm.bmWidth
mul bm.bmHeight
mov ecx, eax
mov esi, lpBits
mov edi, 00FFFFFFh
.while( ecx )

mov eax,[esi] ;get src pixel 1
and eax, mask_50_alpha_24

mov edx, edi ;get src pixel 2
and edx, mask_50_alpha_24

add eax,edx
shr eax,1
mov [esi],eax ;place 50% blended result pixel back
add esi, 4
dec ecx
.endw

xor edx, edx
mov eax, bm.bmWidth
mul bm.bmHeight
mov ecx, eax
mov esi, lpBits
mov edi, 00FFFFFFh
.while( ecx )

mov eax,[esi] ;get src pixel 1
and eax, mask_50_alpha_24

mov edx, edi ;get src pixel 2
and edx, mask_50_alpha_24

add eax,edx
shr eax,1
mov [esi],eax ;place 50% blended result pixel back
add esi, 4
dec ecx
.endw

invoke BitBlt, hdc, 0,0,250,250,hMemDC, 0, 0, SRCCOPY

invoke DeleteDC, hMemDC ;BMDC
invoke DeleteObject, hBm


ret
DoAlpha ENDP


I don't know if i've to use directly the hdc of BeginPaint in DoAlpha or if i've to Create a compatible one.
:confused::confused:

thanks in advance...

Jean / Coder7345
Posted on 2002-05-27 18:46:16 by coder
It's either your bitmasks on the bitblts or the way you create the dib bitmap. I think you might have to initialize the dib or createcompatiblebitmap for it.
Posted on 2002-05-27 23:01:47 by grv575
well the problem is solved. :)

i've another question... is lpBits pointing to (width * height) dwords of RRGGBB pixel colors?
Posted on 2002-05-31 16:35:40 by coder