Hi! I like to use windows API, but I'm fascinated by fast asm algorithms.
As an exercise I'm trying to write code for the following problem.
I've got a table of bytes, that is to say an array composed of 'i_rows' rows and 1024 columns.
I want to find the maximum value within each column and store it. So the outcome would be
an array of 1024 "max values" bytes where the i-th element is the max value within the i-th column of the table.
My poor asm knowledge would suggest to use CMP and jump instructions. Or are there any special Pentium register to use?
Some code snippet is welcome.

Thanks.
:rolleyes:
Posted on 2004-03-19 03:24:06 by _OuzO_
Hi

Say data source is at ...

It should be sthg like this :

init:
movq mm0,
mov ebx,esi

mov eax,
imul ecx,eax,1024
add ecx,ebx

myloop:
movq mm1,
pmaxub mm0,mm1
add ebx,1024
cmp ebx,ecx
jl myloop


This sample should find the max of the 1st 8 columns. Understand it and put it to another loop to get the 1016 others :) Ollydbg is great for learning. Optimizing further : align the loop. Optimizing in the real world : data should be reorganized with swapping of rows/columns for a good use of cache.
Posted on 2004-03-19 07:05:13 by valy
What about computing multiple row-maxvalues in parallel inside the loop? This should be possible at least on P4 since the cache/prefetch unit can handle multiple data streams, right?
Posted on 2004-03-19 07:56:57 by f0dder
Good!

Q1 - Do you know where can I get a quick & easy tute about these mm0, mm1 regs and their related instructions?

Q2 - If P5 has got further regs such as mm1, one could use them to cover 16 columns for every loop, i.e.

init:
movq mm0,
movq mm2, ; this is added
mov ebx,esi

mov eax,
imul ecx,eax,1024
add ecx,ebx

myloop:
movq mm1,
pmaxub mm0,mm1

movq mm3, , ; this is added
pmaxub mm2,mm3 ; this is added

add ebx,1024
cmp ebx,ecx
jl myloop

or is that just fantasy? :alright:
Posted on 2004-03-19 11:23:34 by _OuzO_
mm0 and friends are from the MMX instruction-set, included with the pentium/mmx processor (ie, not plain p5). It's present in more or less every processor today, and I wouldn't really care for targetting < pmmx. I don't know if there's any good & easy tutorials about MMX, the intel specification is somewhat dry. Unless somebody has some good text (which I'd like to see, too :)), try googling for "mmx introduction" or "mmx tutorial" or something like it.

There are eight mmx registers - mm0 through mm7. They are aliased on the floating-point stack, so do not mix MMX and floating-point code!, and remember to issue "emms" after you're done with a block of MMX code.

MMX is done for "Single Instruction, Multiple Data" (SIMD) on integer quantities - things like adding 8 bytes, 4 words or 2 dwords with one instruction.

The pentium3 (I think) furthermore added SSE, Streaming SIMD Extensions, which are SIMD instructions for floating-point values. The Pentium4 introduced SSE2 which has some more fancy SSE, plus it extends the regular MMX instructions to work on 128bit quantities instead of 64bit as the old MMX, plus the SSE/SSE2 registers are not aliased on the floating-point stack.
Posted on 2004-03-19 11:34:51 by f0dder
pmaxub is introduced as part of SSE2 if I remember correctly. So far I have yet to see a tutorial for SSE or SSE2, though I have seen a tutorial for mmx. So the code will only work for processors that have SSE2 like for example P4. Yes, the code has high requirements...
Posted on 2004-03-19 21:46:54 by roticv
Jeremy has some information on using SSE/SSE2 instructions on his website. Near the bottom of the page in Advanced Tutorials and Sample Code.
Posted on 2004-03-19 22:09:38 by donkey
Btw, shl instead of imul when using powers of two.
Posted on 2004-03-20 04:09:40 by f0dder
Yes, Jeremy's site looks good for the purpose. At the moment I found on the web also this:

http://www.cs.wpi.edu/~matt/courses/cs563/talks/powwie/p3/mmx.htm

Thank you all guys,
_OuzO_
:alright:
Posted on 2004-03-22 06:33:08 by _OuzO_