I was rewriting an old pixel blending routine to mmx to try and speed it up. Here is the main calculating part of the origional code, I realise it hard to follow without the full thing, but this is the part I changed to mmx

push ebx
mov ebx,dword ptr

mov eax,ebx
mov edx,
and edx,0ffh
and eax,0ffh
sub edx,eax
imul edx,ecx
sar edx,8
add eax,edx
mov byte ptr tddb,al

mov eax,ebx
mov edx,
shr eax,8
shr edx,8
and edx,0ffh
and eax,0ffh
sub edx,eax
imul edx,ecx
sar edx,8
add eax,edx
mov byte ptr tddb[1],al

mov eax,ebx
mov edx,
shr eax,16
shr edx,16
and edx,0ffh
and eax,0ffh
sub edx,eax
imul edx,ecx
sar edx,8
add eax,edx
mov byte ptr tddb[2],al

mov eax,tddb
pop ebx
mov dword ptr ,eax

You can esailly see from this that I did the blending seperatly for the red, green & blue colours. I though mmx would speed things up seen as it could do all this at the same time.

Heres the code I came up with.

mov al,[3]
shr al,1
mov ah,al

mov [0],ax
mov [2],ax

movd mm(0),
movd mm(1),
movd mm(2),

punpcklbw mm(0),zer
punpcklbw mm(1),zer
punpcklbw mm(2),zer

psubw mm(0),mm(1)
pmullw mm(0),mm(2)
psraw mm(0),7
paddw mm(0),mm(1)

packuswb mm(0),mm(0)
movd ,mm(0)

The mmx code is much smaller, however timings reveal it only runs 2 ticks faster, 35 as opposed to 37 for the first method.

I know every bit helps, but actually the mmx algo does not do exactly what the initial one did, it only allows 127 steps of blending while the former allowed for the full 255.

Perhaps there a better way of doing this in mmx, I very new still the mmx coding. One thing I dislike is the initial setup required for the values before the calculation which itself is just fours instructions.

Anyone have any opinions? I suppose given that the reduction in blending is minor and that 2 ticks could well build up over large images the second algo looks better.
Posted on 2001-12-28 13:12:16 by Eóin
Use the registers, there are eight of them. :grin:
Posted on 2001-12-28 14:14:08 by bitRAKE
I dont quite get you, I don't really know how to fit more registers into the algo without trying to process more than one pixel at a time And that would introduce additional complications. Heres what I'm trying to do.

esi points to a pixel: RRGGBBAA
&nbsp &nbsp AA is the alpha value of the pixel.

ebx similarly point to: RRGGBB??
&nbsp &nbsp The alpha value of this pixel is never used.

edx point to four bytes set to the alpha value of esi. The code I used to create this is:
&nbsp &nbsp mov al,[3]
&nbsp &nbsp shr al,1
&nbsp &nbsp mov ah,al
&nbsp &nbsp The shr was necessary to avoid over flow in the mmx registers, this causes the mmx algo to be less precise

Basically the following calculation is then done to each colour.
ebx = (esi-ebx)(edx/128)+ebx

The movd and punpcklbw combination convert the bytes into words, while packuswb converts them back to bytes.

Beyond this I don't know how to improve the code, any suggestions. :confused:
Posted on 2001-12-28 20:10:59 by Eóin
Well, I do mean processing multiple pixels at a time, and handy tricks like: pxor mm0,mm0 to get zero. Look at all the dependancies in the code! Mixing in another pixel would eliminate almost all of those. Don't store the alpha as bytes, only to load it into the mmx reg! It was already in a register, why store-load to-from memory! These are basic things you should be thinking about.

Draw diagrams of where the data flows and how it is transformed. It will require some work, but the results will be something you can be very proud of.
Posted on 2001-12-28 20:33:46 by bitRAKE
Right so, I'll give that a shot. It should be easy to add one more pixel in anyway.

BTW, do you have any more handy tricks :)
Posted on 2001-12-28 20:39:48 by Eóin
Of course, here are some starters:

pxor mm0,mm0 ;0000000000000000
pcmpeqb mm0,mm0 ;FFFFFFFFFFFFFFFF

pxor mm0, mm0
pcmpeqb mm1, mm1
psubb mm0,mm1 ;0101010101010101

pxor mm0, mm0
pcmpeqb mm1, mm1
psubw mm0,mm1 ;0001000100010001

pxor mm0, mm0
pcmpeqb mm1, mm1
psubd mm0,mm1 ;0000000100000001

pcmpeqb mm1,mm1
psrlw mm1,16-n ;2^(n-1) in each word

pcmpeqb mm1,mm1
psrld mm1, 32-n ;2^(n-1) in each dword

pcmpeqb mm1,mm1
psllw mm1, n ;-2^n in every word

pcmpeqb mm1,mm1
pslld mm1, n ;-2^n in every dword
...and as always RTFM. :grin:
Posted on 2001-12-28 20:46:43 by bitRAKE
Posted on 2001-12-28 21:51:37 by bitRAKE
Thanks for those links, from them I figured out (or rather read) how to get the full 255 levels of blending. Also thanks for suggest that I process two pixel together, it now calculate two pixels in about 43 ticks, as oppsoed to the 74 the first method I used would have taken.

I have one final question please. As i said I the fourth bytw of the pixle data is the alpha value. Ultimatly I need to create a QWORD as follows. 0A0A0A0A. The other methos is to construst a dword of four bytes all set to the alpha value, then mmx instrunctions can convert that to the qword format required.

Can anyone suggest an efficient method for doing this, I know bitRAKE said not to store them, only later to load them but I can't figure out a cleaner solution.

Also I wonder, seeing as I'll have to do it for both pixels seperatly, is there a neat method to do both together.

Any and all help is appreciated, I hope I don't sound like I'm asking to have the job done for me, I'd just like to get this as fast a possible. Thanks all. :)
Posted on 2001-12-29 13:20:35 by Eóin
movd mm2,ALPHA    ; Copy ALPHA into mm2

punpcklwd mm2,mm2 ; Unpack mm2 - 0000 0000 00aa 00aa
punpckldq mm2,mm2 ; Unpack mm2 - 00aa 00aa 00aa 00aa
movq mm0,[esi]    ; 2def1abc

; {put something here - like: add esi,8}
psrlw mm0,8 ; 020e010b
; {put something here}
movq mm1,mm0 ; 020e010b
punpcklwd mm0,mm0 ; 01010b0b
punpckhwd mm1,mm1 ; 02020e0e
punpckhdq mm0,mm0 ; 01010101
punpckhdq mm1,mm1 ; 02020202
You should have only two qword reads and one qword write to memory. Maybe, you can get it down to ~34 ticks for two pixels? :grin:
Posted on 2001-12-29 17:35:54 by bitRAKE
Right, I got the first method working, now I see you have a second method so I'll try that out as well and compare them. Thanks for all this help.

Oh yeah as for "Maybe, you can get it down to ~34 ticks for two pixels?" Why? Is that what you managed, you evil genius. :grin: Currently I'm at about 41.

EDIT -> Ok, second method is 1 tick faster. Where do you come up with this stuff?
Posted on 2001-12-29 19:13:32 by Eóin
Currently, I'm at ~32 on Athlon

...~28! Only two non-MMX instructions in the loop.
	dec ecx

jns @Loop
...now I need to test it on some images...

Edit: ~13!! I could improve the output by adding one to the alpha, but it'd cost a tick. 11 or 12 might be the best possible with only two pixels at once - let me think about it some more...maybe getting rid of the shifts? ...maybe packing earlier and doing a paddsb? Let me know when you want me to post the code? I'm curious how it will time on your CPU?
Posted on 2001-12-29 23:09:46 by bitRAKE
Post the code whenever you think its ready. I'm pretty sure now I won't be able to compete with it, but I can still learn from it.

Ps, thats a very big jump from 28 to 14, did you try a completely different approach or just optimise the method you were using?
Posted on 2001-12-30 07:21:03 by Eóin
Okay, I out did myself - very proud of this. :)

(Of couse, this assumes all data is in the cache - which it isn't in real life.)

Actual use clocks it around ~??, but that is because your waiting on memory to load into the cache. I'll do more tests myself with prefetching the data to improve actual usage speed.

It was some real work, but it feels great when it comes together.
Please, look at the time program to see how I measured.
What does it do on your machine?
Posted on 2001-12-31 00:28:45 by bitRAKE
Athlon-700 thingie... alpha-time:

Minimum Time:
10912 ticks for 1000 quads.

The xfade picture test went very smooth, but also slow (255 full
levels with some waiting between each? :) ).
Posted on 2001-12-31 04:54:19 by f0dder
Athlon 800,

Minimum Time:
10912 ticks for 1000 quads.

Posted on 2001-12-31 07:47:57 by bazik
:tongue: Celeron 333,

Minimum Time:
11440 ticks for 1000 quads.

other program just starts and displays only the underlying desktop-content at that position
Posted on 2001-12-31 08:13:14 by sys64738
AMD Athalon XP 1.53GHz
-- -------------------------
Minimum Time:
10912 ticks for 1000 quads.

Do AMD's actually support MMX instructions or do they just emulate it?
Posted on 2001-12-31 11:11:27 by Mecurius
I don't process WM_PAINT messages in the xFade program. Hey, it's just a little test and I was being lazy. If there wasn't a pause, you wouldn't see much.

I guess the timing thingie works - At least, on Athlons. :)
Emulated or not, you can't tell in the code - besides it running faster. ;) They actually support MMX.
Posted on 2001-12-31 12:21:14 by bitRAKE
11424 ticks on a PIII 700, beautiful work.
And I wasn't as far behind as I thought, my method clocked in at 17568 on your timer program.

But, eemmh, I hate to say this, but I don't thinks it works properly :( .

Now I can't be sure so please prove me wrong, but it seems to suffer from the exact problem I was having when I first changed to algo to mmx. You can't represent 255 levels of alpha if your going to use the equation d+(s-d)(255/a). the proble is s-d can range from -255 to 255, beyond the range of a byte. Because you use unsiged addition and subtraction your code will blend perfectly when s-d > 0 and that happend whenever the destination is darker than the source.

In the fade example it just so happens that this is the case for most of the image. however if you look closely at the red circle closest to the top-left you'll see some of the tree showing through where it was brighter than the circle.

There are two solutions to this problem, the first is to perform the addition and subtraction of the wrod form of the pixels, however this means two additional instructions.

Theres a second problem here that will also occur if (s-d)(a) > the range of a singed word. This tends to occur for a > 128 and s-d > 128. The solution here is limit a to a range from 0 - 127, and you also have to use psraw mm(i),7 for the division.

The second solution is the one I opted for in my code. I found it on those links you gave me, it just your standard method of transforming points and the advantage is that it doesn't involve any negatives so you get the full 255 levels.

You use the equation d((255-a)/255) + s(a/255). Heres the code I came up with:

movq mm(0),[eax][ecx*8]
movq mm(1),[edx][ecx*8]

@@: movq mm(2),mm(0)
movq mm(4),mm(0)

psrlw mm(2),8
movq mm(5),mm(1)

movq mm(6),mm(2)
dec ecx

punpcklwd mm(2),mm(2)
punpckhwd mm(6),mm(6)
punpckhdq mm(2),mm(2)
punpckhdq mm(6),mm(6)

pxor mm(7),mm(7)
movq mm(3),max

punpckhbw mm(4),mm(7)
punpckhbw mm(5),mm(7)
psubw mm(3),mm(2)
punpcklbw mm(0),mm(7)
punpcklbw mm(1),mm(7)

movq mm(7),max
pmullw mm(0),mm(2)
psubw mm(7),mm(6)

pmullw mm(4),mm(6)
pmullw mm(5),mm(7)
pmullw mm(1),mm(3)

paddw mm(4),mm(5)
paddw mm(0),mm(1)
psrlw mm(4),8
psrlw mm(0),8

packuswb mm(0),mm(4)
movq [edx][ecx*8][8],mm(0)

movq mm(0),[eax][ecx*8]
movq mm(1),[edx][ecx*8]
jns @B

I really do hate to be the one to tell you this after all the effort you put into it and in helping me, and I do hope I'm wrong.
Posted on 2001-12-31 13:05:57 by Eóin
Your not wrong at all. :)
Nice work.

Another fix would be to make the alpha a signed byte, and use the highword of the mul. This eliminates a shift and the multiplies. Your absolutely correct though - you can't do saturated byte math at all without loosing data.
Posted on 2001-12-31 13:31:04 by bitRAKE