What is "Only loose 4 bits - instead of 6" mean?
The purpose of the shifting is allow the signed multiply to use all the bits of the alpha as unsigned data - each resulting word of the alpha registers (registers MM4/5 in the algo above) is transformed into ALPHA*128. This leaves the top bit clear - preserving the sign of (DEST - SOURCE) for the signed addition.
This code is untested and you can easily unroll the code to do four pixel in one loop for more speed. :)
This code is untested and you can easily unroll the code to do four pixel in one loop for more speed. :)
SolidAlphaBlend PROC buff:DWORD, len:DWORD, color:DWORD
mov eax,buff
mov ecx,len ; number of pixels
shr ecx,1
dec ecx
pxor mm7,mm7
movd mm6,color ; ....ARGB
movq mm4,mm6 ; ....ARGB
; punpckldq mm6,mm6 ; ARGBARGB ; no need for this! ;)
psrlw mm4,1 ; ....VW..
punpcklwd mm4,mm4 ; VWVW....
punpckhdq mm4,mm4 ; VWVWVWVW
; mm4 = {each word alpha*128}
; movq mm3,mm6 ; code fat be gone... ;)
; punpcklbw mm1,mm7 ; no need for this :)
punpcklbw mm6,mm7 ; low bytes! :P
; mm3 = unpacked color
; mm7 = 0
@@:
movq mm0,[eax+ecx*8]
dec ecx
movq mm2,mm0 ; FEDCBA98
punpcklbw mm0,mm7 ; .E.C.A.8
punpckhbw mm2,mm7 ; .F.D.B.9
psubsw mm0,mm6
psubsw mm2,mm6
psllw mm0,1
psllw mm2,1
pmulhw mm0,mm4
pmulhw mm2,mm4
paddsw mm0,mm6
paddsw mm2,mm6
packuswb mm0,mm2
movq [eax+ecx*8],mm0
jnz @B
ret
SolidAlphaBlend ENDP
The explaination is hard to do in few words - would be better to read the other thread where the algo developed (HERE). You'll also see E?in's work on the algo and he presents an alternate for non-MMX CPU's at the top.Thank you for your willingness to help. I think i now understand and will walk thu what i think is happening correct me if this is wrong (but i do get a solution, so i think i got it):
psrlw mm4,1 ; ....VW..
Divide the unpacked data by two to get effectively Alpha*256/2 + Blue*2, or the 128*Alpha your getting at.
This *is* 128 times the A char + Blue/2, now copied into all four word locations.
mm1 is undefined to start with, but the high bytes of each word is now formatted to 0. mm3 is the same but now: .A.B.G.R, ok.
Ok, 8 bytes (two pixels) are read, and unpacked into words over two mmx registers. Then the difference is found between each unpacked values (A-a),(B-b),(G-g),(R-r) for the two pixels vs. the set blend 'rgb' color.
Now multiply each unpacked difference word by 2 thru the two pixels. This is setting up for the upcomming code.
Ok, now each unpacked word component for ARGB of each pixel is multiplied by:
[ 2*(R-r) ] *[ 128*Alpha + Blue/2 ] == [(R-r)*Alpha*256 + (R-r)*Blue].
This command also takes the upper word of the dword result. This totaly ignores the (R-r)*Blue because 2^8 * 2^8 = 2^16 and is dropped.
Also, if alpha == 256 the result is just (R-r). If alpha is 0, then its 0. Alpha = 128 its (R-r)/2. And this is applied evenly thru all components of both pixels. As i see now, this is how a percentage is found. With a resolution of 1/256% alpha per step between 0->255.
Now the %(R-r) is added to the 'r' blend color to start with. Effectively adding a percentage of the difference of the two colors as I stated in the earlier equation: D + Alpha%(S-D) for alpha blend.
Repack the two pixels A'B'G'R'A'B'G'R', and save them. Then loop onto the next two pixels.
Thanx I think i got it.... Umm i dont think the line punpcklbw mm1,mm7 is used or needed tho?
Thanx again bitRake and Eoin!
:alright:
NaN
SolidAlphaBlend PROC buff:DWORD, len:DWORD, color:DWORD
mov eax,buff
mov ecx,len ; number of pixels
shr ecx,1
dec ecx
pxor mm7,mm7
movd mm6,color ; ....ARGB
movq mm4,mm6 ; ....ARGB
punpckldq mm6,mm6 ; ARGBARGB
psrlw mm4,1 ; ....VW..
Divide the unpacked data by two to get effectively Alpha*256/2 + Blue*2, or the 128*Alpha your getting at.
punpcklwd mm4,mm4 ; VWVW.... punpckhdq mm4,mm4 ; VWVWVWVW
; mm4 = {each word alpha*128}
This *is* 128 times the A char + Blue/2, now copied into all four word locations.
movq mm3,mm6
[b] punpcklbw mm1,mm7[/b] ;??????
punpckhbw mm3,mm7
; mm3 = unpacked color
; mm7 = 0
mm1 is undefined to start with, but the high bytes of each word is now formatted to 0. mm3 is the same but now: .A.B.G.R, ok.
@@:
movq mm0,[eax+ecx*8]
dec ecx
movq mm2,mm0 ; FEDCBA98
punpcklbw mm0,mm7 ; .E.C.A.8
punpckhbw mm2,mm7 ; .F.D.B.9
psubsw mm0,mm3
psubsw mm2,mm3
Ok, 8 bytes (two pixels) are read, and unpacked into words over two mmx registers. Then the difference is found between each unpacked values (A-a),(B-b),(G-g),(R-r) for the two pixels vs. the set blend 'rgb' color.
psllw mm0,1
psllw mm2,1
Now multiply each unpacked difference word by 2 thru the two pixels. This is setting up for the upcomming code.
pmulhw mm0,mm4
pmulhw mm2,mm4
Ok, now each unpacked word component for ARGB of each pixel is multiplied by:
[ 2*(R-r) ] *[ 128*Alpha + Blue/2 ] == [(R-r)*Alpha*256 + (R-r)*Blue].
This command also takes the upper word of the dword result. This totaly ignores the (R-r)*Blue because 2^8 * 2^8 = 2^16 and is dropped.
Also, if alpha == 256 the result is just (R-r). If alpha is 0, then its 0. Alpha = 128 its (R-r)/2. And this is applied evenly thru all components of both pixels. As i see now, this is how a percentage is found. With a resolution of 1/256% alpha per step between 0->255.
paddsw mm0,mm3
paddsw mm2,mm3
Now the %(R-r) is added to the 'r' blend color to start with. Effectively adding a percentage of the difference of the two colors as I stated in the earlier equation: D + Alpha%(S-D) for alpha blend.
packuswb mm0,mm2
movq [eax+ecx*8],mm0
jnz @B
ret
SolidAlphaBlend ENDP
Repack the two pixels A'B'G'R'A'B'G'R', and save them. Then loop onto the next two pixels.
Thanx I think i got it.... Umm i dont think the line punpcklbw mm1,mm7 is used or needed tho?
Thanx again bitRake and Eoin!
:alright:
NaN
Posted on 2002-05-05 18:03:20 by NaN
Sorry, I do that a lot because I don't like making a thousand posts. Also, note after reading your post I made three corrections to the code. You seem to have a good grasp of it - way to go! Look at the rigisters with some test data in Ollydbg is the best way, imho.
Never used Ollydbg. Will have to check it out.
Errors that crashed my machine *again* gave me reason to get Ollydbg sooner than i thought. :) * I like the User Interface, but wish you can close a file without exiting, so i can recompile :rolleyes: .
Anyways, with its help i saw an error we have both overlooked. Well actually two, from the same source: the way the memory is being called and saved to. You're decrementing backwards in memory (which is ok), but you start with 8 bytes beyond the bitmap boundry, and finish 8 bytes too soon when you dec/jnz in a loop. As well, the more serious problem was that the source bytes are not 1:1 to the destination bytes, since ECX was being decremented before the MMX algo and its save point (This is what crashed the machine ~ hard ;) )
So here is my fix to your source and works well now.
Thanx again bitRAKE for all your help!
:alright:
NaN
Errors that crashed my machine *again* gave me reason to get Ollydbg sooner than i thought. :) * I like the User Interface, but wish you can close a file without exiting, so i can recompile :rolleyes: .
Anyways, with its help i saw an error we have both overlooked. Well actually two, from the same source: the way the memory is being called and saved to. You're decrementing backwards in memory (which is ok), but you start with 8 bytes beyond the bitmap boundry, and finish 8 bytes too soon when you dec/jnz in a loop. As well, the more serious problem was that the source bytes are not 1:1 to the destination bytes, since ECX was being decremented before the MMX algo and its save point (This is what crashed the machine ~ hard ;) )
So here is my fix to your source and works well now.
SolidAlphaBlend PROC buff:DWORD, len:DWORD, color:DWORD
mov eax,buff
mov ecx,len ; number of pixels
shr ecx,1
dec ecx
pxor mm7,mm7
movd mm6,color ; ....ARGB
movq mm4,mm6 ; ....ARGB
psrlw mm4,1 ; ....VW..
punpcklwd mm4,mm4 ; VWVW....
punpckhdq mm4,mm4 ; VWVWVWVW
punpcklbw mm6,mm7 ; low bytes! :P
xor edx, edx
.while (edx < ecx )
movq mm0,[eax+edx*8]
movq mm2,mm0 ; FEDCBA98
punpcklbw mm0,mm7 ; .E.C.A.8
punpckhbw mm2,mm7 ; .F.D.B.9
psubsw mm0,mm6
psubsw mm2,mm6
psllw mm0,1
psllw mm2,1
pmulhw mm0,mm4
pmulhw mm2,mm4
paddsw mm0,mm6
paddsw mm2,mm6
packuswb mm0,mm2
movq [eax+edx*8],mm0
inc edx
.endw
ret
SolidAlphaBlend ENDP
Thanx again bitRAKE for all your help!
:alright:
NaN
Here is the fixed loop.
SolidAlphaBlend PROC buff:DWORD, len:DWORD, color:DWORD
mov eax,buff
mov ecx,len ; number of bytes
shr ecx,3 ; (4 bytes/pixel) / (2 bytes/loop)
pxor mm7,mm7
movd mm6,color ; ....ARGB
movq mm4,mm6 ; ....ARGB
punpcklbw mm6,mm7
psrlw mm4,1 ; ....VW..
punpcklwd mm4,mm4 ; VWVW....
punpckhdq mm4,mm4 ; VWVWVWVW
jmp _1
ALIGN 8
_0: movq mm0,[eax+ecx*8]
movq mm2,mm0 ; FEDCBA98
punpcklbw mm0,mm7 ; .E.C.A.8
punpckhbw mm2,mm7 ; .F.D.B.9
psubsw mm0,mm6
psubsw mm2,mm6
psllw mm0,1
psllw mm2,1
pmulhw mm0,mm4
pmulhw mm2,mm4
paddsw mm0,mm6
paddsw mm2,mm6
packuswb mm0,mm2
movq [eax+ecx*8],mm0
_1: dec ecx
jns _0
ret
SolidAlphaBlend ENDP
Unroll and interleave two loops (using two more MMX registers), then unroll again for 8 pixels in one loop, and add prefetch instruction, then your really cooking...Thanx bitRAKE, but i dont follow this 'unroll' business.
The is 'spare' mmx reg's, sure. But it dont see how fetching 16 bytes and doing the process two times over in one loop, is any fanster (relatively) than fetching 8 bytes, and one process.
They both loop, and since the MMX reg field is only 8 bytes, after this your saturated for performance savings, right? I mean, another instruction is another instruction, weather i do two times in a loop, or one times in a loop, the same number of executions are the same??
Thanx.
:NaN:
The is 'spare' mmx reg's, sure. But it dont see how fetching 16 bytes and doing the process two times over in one loop, is any fanster (relatively) than fetching 8 bytes, and one process.
They both loop, and since the MMX reg field is only 8 bytes, after this your saturated for performance savings, right? I mean, another instruction is another instruction, weather i do two times in a loop, or one times in a loop, the same number of executions are the same??
Thanx.
:NaN:
There are forward dependancies and my cacheline size is 64 bytes. You could not unroll, but you'd need an inner/outer loop. This is strange territory as processors become very different on what they need in the code: manual prefetch, prefetch instruction, or nothing (auto-prefetch Athlon XP/P4).
Ah yes, that evil *transparent* cashe ;)
I forgot about this. Thanx!
I forgot about this. Thanx!
On faster CPU's there will be no performance gain without prefetch - that is why FSB (front side bus) speed is so important - most software doesn't prefetch. So, the CPU is stuck pulling data from main memory into --> L2 cache --> L1 cache. P4 even has an L3 cache, iirc. :eek:
hello..
i try to use DoAlpha() proc, but i get a white window :(
I've debuged, and api return values are ok. So probably the code is bad in some api param.
Now DoAlpha() code
I don't know if i've to use directly the hdc of BeginPaint in DoAlpha or if i've to Create a compatible one.
:confused::confused:
thanks in advance...
Jean / Coder7345
i try to use DoAlpha() proc, but i get a white window :(
I've debuged, and api return values are ok. So probably the code is bad in some api param.
[b]WM_CREATE[/b]
invoke LoadBitmap,hInstance,addr BitmapName
mov hBitmap,eax
(...)
[b]WM_PAINT[/b]
invoke BeginPaint,hWnd,addr ps
mov hdc,eax
CALL DoAlpha
invoke EndPaint,hWnd,addr ps
Now DoAlpha() code
DoAlpha PROC USES esi edi
;============================================
LOCAL bm :BITMAP
LOCAL bmi :BITMAPINFO
LOCAL lpBits :DWORD
LOCAL SDC :DWORD
LOCAL OldBM :DWORD
LOCAL hBm:DWORD
LOCAL BMDC :DWORD
LOCAL hMemDC:DWORD
invoke CreateCompatibleDC,hdc
mov hMemDC,eax
invoke GetObject, hBitmap, sizeof BITMAP, addr bm
invoke RtlZeroMemory, addr bmi.bmiHeader, sizeof BITMAPINFOHEADER
mov eax, sizeof BITMAPINFOHEADER
mov bmi.bmiHeader.biSize, eax
mov eax, bm.bmWidth
mov bmi.bmiHeader.biWidth, eax
mov eax, bm.bmHeight
neg eax
mov bmi.bmiHeader.biHeight, eax
mov bmi.bmiHeader.biPlanes, 1
mov bmi.bmiHeader.biCompression, BI_RGB
mov bmi.bmiHeader.biBitCount, 32
invoke CreateDIBSection, hMemDC,addr bmi, DIB_RGB_COLORS, addr lpBits, NULL, NULL
mov hBm, eax ;LOCAL handle
invoke SelectObject,hMemDC,eax
invoke BitBlt, hMemDC, 0,0,250,250,hdc, 0,0, SRCCOPY
; Do Alpha.
mask_50_alpha_24 equ 0111111101111111011111111b
xor edx, edx
mov eax, bm.bmWidth
mul bm.bmHeight
mov ecx, eax
mov esi, lpBits
mov edi, 00FFFFFFh
.while( ecx )
mov eax,[esi] ;get src pixel 1
and eax, mask_50_alpha_24
mov edx, edi ;get src pixel 2
and edx, mask_50_alpha_24
add eax,edx
shr eax,1
mov [esi],eax ;place 50% blended result pixel back
add esi, 4
dec ecx
.endw
xor edx, edx
mov eax, bm.bmWidth
mul bm.bmHeight
mov ecx, eax
mov esi, lpBits
mov edi, 00FFFFFFh
.while( ecx )
mov eax,[esi] ;get src pixel 1
and eax, mask_50_alpha_24
mov edx, edi ;get src pixel 2
and edx, mask_50_alpha_24
add eax,edx
shr eax,1
mov [esi],eax ;place 50% blended result pixel back
add esi, 4
dec ecx
.endw
invoke BitBlt, hdc, 0,0,250,250,hMemDC, 0, 0, SRCCOPY
invoke DeleteDC, hMemDC ;BMDC
invoke DeleteObject, hBm
ret
DoAlpha ENDP
I don't know if i've to use directly the hdc of BeginPaint in DoAlpha or if i've to Create a compatible one.
:confused::confused:
thanks in advance...
Jean / Coder7345
It's either your bitmasks on the bitblts or the way you create the dib bitmap. I think you might have to initialize the dib or createcompatiblebitmap for it.
well the problem is solved. :)
i've another question... is lpBits pointing to (width * height) dwords of RRGGBB pixel colors?
i've another question... is lpBits pointing to (width * height) dwords of RRGGBB pixel colors?