this article became kinda long... short version:
is there a way to do arithmetically correct parallel subtraction of bytes, ie X R G B are 4 bytes in a row in memory - i have to subtract certain values from each byte a lot of times in an arithmetically correct way... is there a way to do this without having to splice the RGB values out each single loop? like MMX quadword operations, just for bytes? :)
if not - can you tell me how to stretch the 4 bytes into 4 words to get MMX to work and how to operate (parallel subtract) with them, finally, how to get the resulting byte values out of it again?
thank you very much in advance!
detailed version:

hi there, doing my recent prog, a picture manipulating problem, i ran into the following problem:
for my newest filter, a sharpen operator, i need to do a lot of subtraction:

X1 X2 X3
X4 O X5
X6 X7 X8

my formula for the O(riginal) pixel's new value is:
O = O + 1/16(O-X1) + 1/16(O-X2) + ... + 1/16(O-X8)
which can be simplified to:
O = O + 1/16(O*8 - X1 - X2 - ... - X8)

the *8 multiplication is a nice "shl 3" and /16 becomes "shr 4"... the problem is the many subtractions, because they have to be done for each colour seperately :(

RGB are stored in memory as consecutive bytes. right now, i have to extract and manipulate each color seperately which costs a lot of time and anger ;)
this looks like this:

mov eax,dword ptr
mov Blue,eax
and Blue,0FFh
mov Green,eax
and Green,0FF00h
shr Green,8
mov Red,eax
and Red,0FF0000h
shr Green,16

... now i have a BlueTemp, GreenTemp, RedTemp variable, initialized with 8*the original RGB value i substract the single values from. this has to be done a whola lotta 8(!!!) times for each single pixel :/ splice original RGB, multiply by 8 and store in Temp variables then 8 times the same: each time splice RGB, subtract from Temp RGB values - recombinate RGB after 8 pixels.
this sucks and i want to do the subtraction parallel which would save a wholla lotta time of splicing and single subtraction.

the problem is that 8*original value is in the worst case 8*255 = 2040, which means a byte is too less to store it :/ so what i would really need is a function that can do parallel subtraction of 3 BYTE values from 3 WORD values, all in memory :)

can you point me the right direction? is the answer MMX? how
could i make use of it concerning my problem?
have you any ideas how i could make it better?

thank you really a lot, have a nice day,
Posted on 2002-11-09 17:42:18 by BugByter
MMX would certainly be faster. More important is to think of the problem in parallel: imagine each pixel with the desired equation and break the problem into smaller parallel passes. You don't want the algo to do each pixel by itself, but to process the image in strips. This will maximize cache usage and minimize memory accesses.

Look at a_sharpen.asm in:

...for examples MMX and non-MMX. :)
Posted on 2002-11-10 00:15:46 by bitRAKE
Whoa! VirtualDub is open source!? Thanks for the link bitRAKE, this rocks!
Posted on 2002-11-10 03:39:03 by Qweerdy
thanks a lot bitrake for the idea with virtualdub, i was too dumb anyway, have to do it myself or else it would take too long to get it ;)
found really good mmx info here (easy mmx primer):
"art of assembly" mmx with nice >>> PICTURES <<< of packing/unpacking instructions (but spelling error at shift right instr):
how to detect if mmx is provided:
big black tutorial, no pics:

btw browsed through - really nice collection of very useful texts!

thanks again for your help, after one more day of work it now does real time sharping, approx (guessed) 2 times as fast than before even on really large pics :)
Posted on 2002-11-10 09:50:21 by BugByter