HI, I'm learning asm language to add some inline _asm instruction of MMX or SSE to accelerate image processing.After some segment on the original image, I can get the raster data like the pic below. the "1" stands for the object area.


my object is to determine the center (called centroid of the object on image) like


some times the image is very large , and I should use the algorithms like:
for example : the image size is Xsize * Ysize

I'd to define some Array like x ,y

then I could walk though all the pixel line by line. If I found a "1" I should add x and y, i,j is the corresponding coordinates of the pixel. than I could get the distribution like. on each x,y axis.


Then, I can locate the center of the object.

But these algorithms is very slow if I use the C or C++ language ,which in my project, the image processing is very time critical. that means ,I should processing a 1000*1000 image in less than 10ms.

so, I think The MMX or SSE instruction may help me.
can some one give me suggestion on this. Thanks.

by the way, This image size is too large to fit the post, is there some html scrips to make it smaller?
Posted on 2008-01-02 21:34:53 by z23
Caching is even more important here. Because only RAM bandwidth is the problem here.
The trick will be to process 16 columns at once. Or as many as possible. Integer-handling SSE (was it SSE2 or SSE3 I don't remember right now) is the way to go.
Make sure that you only read from memory and update only register-values.
Also, try to do some prefetching.
If the image-width is constant, it'll help you even more. If not, then make SMC (self-modifiable code) - you'll only need to patch-up dword integer-values in instructions like "movaps xmm0," (to change the value 1000 to the necessary number).

I'm hinting to avoid using memory-writes in your loop, because it'll get in the way of burst-reading from DDR. Even if only a little.
Maybe you'll also need to use instructions, that avoid the cache.

But you can use the cache for optimal speed in another way: process the image in 64x64px regions, returning two 64-element arrays: the horizontal sums and the vertical sums. With such cached data, you can get away with MMX or even plain old 486 code.
Or process in 64x32, and do double-buffering (process one tile while prefetching the other).

The first thing you really need to do is grab the whitepapers of the CPUs you will target. - what's the size of L1 cache, what prefetch/cache/non-cache instructions they have, what SSE version they have, how long are the cache-miss penalties, how long are the cache-lines and memory-bus width. And the type of memory they use (SDRAM,DDR,DDR2) - what are the features/clockspeeds/burst-read lengths.

I can bet that the code can be done to process 1000x1000 in 3ms on a 2GHz cpu with DDR@400MHz


P.S. Hmm, I just realized... your table is comprised of bytes, not dwords - right? This will make things even easier :D. Only MMX can handle bytes in parallel, and quite well at that. You only need to first upconvert unsigned bytes to unsigned dwords inside the MMX registers. Though, write-backs to RAM probably can't be avoided. So, I suggest processing the image in 128x128px tiles (or 125x125px), first summing-up the rows - then summing-up the columns in that tile. (because you won't have enough spare registers to put vertical sums in, while downloading the tile to L1 cache).

P.P.S. Don't use inline asm, do yourself a favor and put the asm code in a separate file.
Posted on 2008-01-02 22:30:12 by Ultrano
thanks Ultrano.

my computer is P43.0E and has 256*2 DDR400 Ram.
I have read some stuff about the SSE2 instruction supplied by the processor. With the 128bit register of SSE2, I can use 16bytes operation pararrelly.

Caching is even more important here. Because only RAM bandwidth is the problem here.
The trick will be to process 16 columns at once. Or as many as possible. Integer-handling SSE (was it SSE2 or SSE3 I don't remember right now) is the way to go.
Make sure that you only read from memory and update only register-values.
Also, try to do some prefetching.
If the image-width is constant, it'll help you even more. If not, then make SMC (self-modifiable code) - you'll only need to patch-up dword integer-values in instructions like "movaps xmm0," (to change the value 1000 to the necessary number).

through I have leaned asm programming with MASM32 packed. But I'm not familiar with the "Caching" programming. Is it hard to learn?

P.S. Hmm, I just realized... your table is comprised of bytes, not dwords - right? This will make things even easier :D. Only MMX can handle bytes in parallel, and quite well at that. You only need to first upconvert unsigned bytes to unsigned dwords inside the MMX registers. Though, write-backs to RAM probably can't be avoided. So, I suggest processing the image in 128x128px tiles (or 125x125px), first summing-up the rows - then summing-up the columns in that tile. (because you won't have enough spare registers to put vertical sums in, while downloading the tile to L1 cache).

SSE2 can also handle bytes in Parrallel, My image is some homochromy,which I have already translate the RGB  to it's luminance i=(R+G+B)/3 , so the raster data is one byte/pixel.

But I found that the SSE2 instruction need memory aligned 16bytes, so I should allocate some data buffer aligned by 16 bytes. and load it to XMM[0]...XMM[7], each XMM register can hold 16 bytes. so ,I can only load 16*8 pixels once a time.

But I'm not sure the SSE2 instruction hase some functionality to sum the horizontal array and vertical arrays. Sometimes I feel it's so hard to do this. I even wander I should use some libarys like Intel IPP and OpenCV.

P.P.S. Don't use inline asm, do yourself a favor and put the asm code in a separate file.

Ok, I'd like to do that, the inline asm is very restricted on compilers, now, I should write them in ASM files.





Posted on 2008-01-02 23:19:20 by z23
One small note about Luminance - depending on what kind of image processing you are doing, it might pay to find out what 'gamma correction' is about.
Posted on 2008-01-02 23:27:11 by Homer
The original format of the image data is some thing called Bayer image.
which can find it here:
http://en.wikipedia.org./wiki/Bayer_filter

than, I'd do some algorithm to change the Bayer image(one bytes per pixel) to RGB image(3 bytes per pixel).

then ,I have to change the RGB image to luminance using the formular I=(R+G+B)/3;

then I'd do some threshold method on the I data to segment the object area from the enviroment.

All these steps need a lot of CPU time.
Posted on 2008-01-02 23:42:40 by z23
Hmm, your original image is 24-bit RGB. With that conversion you're doing beforehand, you're severely killing performance. Now, is the input data 24-bit or 32-bit (padded RGB) ?

The best action is to be doing the 24->8 conversion on the fly while summing-up data. You won't divide by 3 or use any tricks, instead a pixel will have values of 0..765.
s
"Caching" programming...hum - it's simply about knowing how the cache works and what its size and layout are. Then sequencing your instructions in such a way, that your code is constantly fed with data, without being stopped by penalties. It's much more simple than it sounds. Just study the whitepapers of your cpu's caching, the ideas and knowledge will come immediately.

8 16-byte XMM registers won't hold 128 pixels, you do need a good number of the XMM registers to do data-unpacking and preprocessing.

Bottom line: Process the stuff in tiles (calculate the size to fit nicely in half of L1), do double-buffering, first calculate sums horizontally (this will put data in L1 cache easily), use SSE2 for it all (you need the 128-bit MMX instructions of SSE2) .

It's a pity you target P4. This could all be done at top speed with general-purpose registers on AMD cpus :)
Posted on 2008-01-03 00:03:42 by Ultrano
You might consider using a matrix to transform your pixel data.
Its not an immediately obvious solution.. we can represent any series of LINEAR operations (we are performing apon each pixel) using just one matrix, and then we can apply that matrix to each input pixel and dramatically speed up the processing time with an error that is so small it usually won't be a problem.

Of course, its only worth considering if the total cost of your linear operations (per pixel) is heavy-duty, ie, more than the cost of doing all the multiplying to make your combined matrix plus the cost of applying it to each pixel... but that can often be the case when we are processing images.

See 'Convolution Filter' for a clearer idea, but if you already understand what matrices are about, there's lots of room for innovation here.
Posted on 2008-01-03 01:07:34 by Homer
This might be "ob duh", but use VirtualAlloc to allocate your buffers to ensure you have the data alignment you need. And of course, allocate only at program start if you can get away with it, instead of constant alloc/free.
Posted on 2008-01-03 04:39:04 by f0dder
Maybe I don't know enough, but:
- proper gamma correction will involve a 768-entry LUT of floats. And making the whole thing sum-up floats. For just a few percent better results in the centroid-position.
- convolution ... seems quite unnecessary here. Besides, convolution on 1000x1000  in 10ms on a P4...
- if he has the option to manually define/allocate the buffer, usually 8+ million cycles are wasted beforehand - I hope that is not the case. Anyway 24-bit RGB is far from being comfortable for SSE, any alignment problem is much easier to solve, compared to properly and quickly unpacking+handling this data. And if truly the image is 1000x1000, you'll anyway have to handle unaligned data once on every scanline.

z23, by any chance is this a project to do Wiimote-like position detection of infrared light behind an infrared-pass filter (that makes the background black), with only one emitter tracked instead of four?
What exactly is the format of the pixels, RGB 24bpp, BRGB 32bpp or XRGB 32bpp?
Posted on 2008-01-03 09:20:19 by Ultrano
The image I will analysis is a one byte per piexel. each byte stands for the Lightness from 0-255. 0=black, 255= white.
for example,
***step 1  a camera has transfer a image of 100*100, which occupies 100*100 bytes. (This is called Bayer image, you can find more detail in wiki)
***step 2  I will create a 100*300 bytes buffer, then translate the bayer image to RGB image by some function provided by the camara company.
***step 3 I will translate the RGB image to HSL(hue,saturation,lightness) image, because I only use the ligntness part, so it came out the 100*100 bytes contian each pixel's lightness.
***step 4 I will do some image processing on the image generated by step 3, and get the result showed in my first post.
***step 5, This is the biggest problem, how to identify the center of the object.


Posted on 2008-01-03 19:29:18 by z23