Thanks, and I hope I haven't spoiled your new year.

I've thought of something else which would make the algo more useful, if the alpha values of each pixel could also be blended then you have two forms of control, whole image and per pixel. Of course you wouldn't always want this so it should be written as a seperate algo.

I've a pretty good idea on how to implement this so I'll give it a try, its just a pity there are no register left, why do you always run out so quickly?
Posted on 2001-12-31 14:38:00 by Eóin
I got some very interesting errors while coding this. I will be playng with this stuff on/off for some time. Maybe, make up a little test program that just assembles MMX instructions. So, we can edit-test-edit-test real quickly.

Did you move the memory accesses to the end of your loop before or after you had seen my algo. :) Don't all those ()()()()()() bother you when your playing with the register usage within the algo.

It's hard to spoil a New Year. :alright:

p.s. Also, don't forget that your dividing by 256 and the alpha only has a range of 0-255 (ie 255/256 doesn't equal one, but it's close enough.)

p.s. Stereopsis is another good place to read from.
Posted on 2001-12-31 16:57:46 by bitRAKE
You got me, I moved them after seeing your algo :grin:. I was actually just comparing my code and yours when I noticed that, I made the same change and it speeded things up a bit.

Unfortunatly I had to remove them then from the working algo as on the final pass through the loop it would read the quad before the first source pixel and crash.

I too ran into a lot of strange errors, almost all to do with negative numbers. I'd then copy what I had into a small test app which could display intermediate results so I could see what was happening.

As for brackets, yeah they do bother me, but it never occured to me to try and change them. Now I'm considering naming the registers mma, mmb, mmc, ... and including them in the syntax hilighting. But that then means I'd really need to chage any source code back to the standard form if I'm going to post it. All in all its probably more hassle than its worth.

Anyway Happy New Year and thanks for the link. :)
Posted on 2001-12-31 18:49:42 by Eóin
Anyone want to see what prompted me to go to all this effort. I was writing a small version of Pong Hau K'l and I wanted nice smooth edges on the pieces, not pixelated ones.

So here's the game for anyone who wants it. Theres no AI so you'll need to have two players, in fact theres no built in rules other than it only allows legal moves so you'll have to take turns yourself and judge when someone wins.

For those that don't know the rules, they're very simple. One player takes red, the other is yellow and you take it in turns to move one of your pieces into the free spot. Someone wins when the other player can't make a move because both they're pieces are blocked. Its very simple, and was more or less a waste of time I suppose, except for the huge amount I learned thanks to bitRAKE. And that in itself justifes the effort.

But still once you see it you'll have to agree, those pieces are perfectly round. And bitRAKE, don't worry, I have greater uses for a blending algo than just this silly game.
Posted on 2001-12-31 19:49:29 by Eóin
Cool interface.

How do you pronounce E?in?

I wonder if your blending algo is the
fastest freely availible on the web?
15552 ticks for 1000 quads on Athlon.

This is part of my common macro file:
mm0 EQU <MM0>

mm1 EQU <MM1>
mm2 EQU <MM2>
mm3 EQU <MM3>
mm4 EQU <MM4>
mm5 EQU <MM5>
mm6 EQU <MM6>
mm7 EQU <MM7>

st0 EQU <st(0)>
st1 EQU <st(1)>
st2 EQU <st(2)>
st3 EQU <st(3)>
st4 EQU <st(4)>
st5 EQU <st(5)>
st6 EQU <st(6)>
st7 EQU <st(7)>
Most any modern editor could be programmed to search and replace either way.
Posted on 2001-12-31 20:47:48 by bitRAKE
Eoin is pronounced exactly the same as "own". In fact the English spelling of the name is Owen.

Thanks for the compliments, they are very much appreciated. As for It being the fastest alpha blend, well I don't know. It went back up to 20000 when I added in the second overall blend. However I was forced to add it in as a pmullw and psrlw. I was hopeing I could implement it without the shifts by taking the highword of the multiplication, but then I discovered that pmulhw only takes signed words.

Signed values seem to be an endless problem :grin: .
Posted on 2002-01-01 06:25:32 by Eóin
In that case you might find this faster, signed version interesting?
Oh, look a free register too. ;)
TimeThis: ; [b]13056[/b] ticks for 1000 quads on Athlon.

mov eax,Source
mov edx,Destination
mov ecx,Count ; number of pixels/2
dec ecx
pxor mm7,mm7
movq mm0,[eax+ecx*8]
movq mm1,[edx+ecx*8]
@@: movq mm4,mm0
movq mm2,mm0
psrlw mm4,1
movq mm3,mm1
movq mm5,mm4
punpcklbw mm0,mm7
punpcklbw mm1,mm7
punpckhbw mm2,mm7
punpckhbw mm3,mm7
psubsw mm0,mm1
psubsw mm2,mm3
punpcklwd mm4,mm4
punpckhwd mm5,mm5
punpckhdq mm4,mm4
punpckhdq mm5,mm5
psllw mm0,1
psllw mm2,1
pmulhw mm0,mm4
pmulhw mm2,mm5
paddsw mm0,mm1
paddsw mm2,mm3
packuswb mm0,mm2
movq [edx+ecx*8],mm0
dec ecx
movq mm0,[eax+ecx*8]
movq mm1,[edx+ecx*8]
jns @B
ret 12
Posted on 2002-01-01 11:10:45 by bitRAKE
If I've said it once, I've said it a thousand times, where do you come up with this stuff, it's genius. :)

Below is the slight modification I made to allow the additional overall alpha control, mm6 need only be setup with the alpha value outside of the loop and your off.

This allows for some lovely animated effects such as smoke and fire. It times at 18 ticks (was at 15 before I added this bit), surely you must now have written one of the fastest alpha blend algos out there, great work bitRAKE :alright: .

Posted on 2002-01-01 20:16:00 by Eóin
Thanks, I just hope everyone will use it - I hate slow software. :grin:
Nice addition - can't wait to see what you have planned for this!
Posted on 2002-01-01 22:40:24 by bitRAKE
	pxor mm7,mm7

movq mm6,GlobalAlpha
; top bit clear to ensure positive number
psllw mm6,7
movq mm0,[eax+ecx*8]
movq mm1,[edx+ecx*8]
@@: movq mm4,mm0
movq mm2,mm0
; top bit clear to ensure positive number
psrlw mm4,1
movq mm3,mm1
movq mm5,mm4
punpcklbw mm0,mm7
punpcklbw mm1,mm7
punpckhbw mm2,mm7
punpckhbw mm3,mm7
psubsw mm0,mm1
psubsw mm2,mm3
punpcklwd mm4,mm4
punpckhwd mm5,mm5
punpckhdq mm4,mm4
punpckhdq mm5,mm5
psllw mm0,2
psllw mm2,2

pmulhw mm4,mm6 ; top two bits are clear
pmulhw mm5,mm6 ; ie 7F... * 7F... = 3FF...

pmulhw mm0,mm4
pmulhw mm2,mm5
; total scaling gives signed-words to below
paddsw mm0,mm1
paddsw mm2,mm3

packuswb mm0,mm2
movq [edx+ecx*8],mm0
dec ecx
movq mm0,[eax+ecx*8]
movq mm1,[edx+ecx*8]
jns @B
Please, let me know if this works?
It seems logical that the shifts aren't needed. ;)
Better accuracy, too.
Posted on 2002-01-02 03:47:56 by bitRAKE
Yes, it worked perfectly once I reformatted the the way GlobalAlpha was represented in mm6. And it clocks at 15.5 on my laptop, its really really fast code.

Its things like this that show why assembly is still needed, could any compiler even come close to this? I doubt it very much.

I think I should knock together a nice tech demo to show off what can be achieved with this.
Posted on 2002-01-02 06:21:13 by Eóin
it's nice to see people optimize where there's actually something
to gain ;). Good work you two..
Posted on 2002-01-02 06:28:58 by f0dder
Use PSHUFW to expand alpha values and save another cycle+. :)
Posted on 2002-06-05 00:28:06 by bitRAKE
Is anything happening with this code anymore?

Anyone working on a demo?
Posted on 2002-08-22 04:41:58 by Qweerdy
I ultimatly used it in this library which most recently I used for a simple internet chess program.
Posted on 2002-08-22 07:41:25 by Eóin
Thanks... looking at the date you posted that, I guess I must have missed it while I was on vacation :(

This lib looks very good :alright:
Posted on 2002-08-22 13:00:51 by Qweerdy
Thanks, one thing I'd like to add is DX support.

I came across a Life program, Life32. I'd love to implement the way it uses DX, as it allows for windowed mode.

It's not a perfect implementation but if high frame rates in windowed mode were necessary I'd say its good enough.

Overlays could be another solution, but my lack of DX experience blocks me on both these avenues.
Posted on 2002-08-22 16:50:51 by Eóin

AMD Athalon XP 1.53GHz
---------------------------
Minimum Time:
---------------------------
10912 ticks for 1000 quads.
---------------------------
OK
---------------------------


On my athalon 1.4Ghz I got exactly the same result???
Posted on 2002-08-23 19:25:57 by huh
That's not so strange, on your pc there's just less ticks per second :)
Posted on 2002-08-24 06:23:02 by Qweerdy
That's not so strange, on your pc there's just less ticks per second


True:stupid:
Posted on 2002-08-25 01:21:49 by huh