byte mem[1000000] = {12,8,7,1,100,200,31,41,...................};
byte index[255] = {5,2,10,50,1,4,7,9,111,........};

for (i=0; i < 1000000; i ++)
mem = index];

how to optimize with asm
( i know i should use xlat to optimize it. Is there any other optimizition with asm using MMX?)
thx a lot
Posted on 2004-03-18 20:18:52 by kerrylau
Don't use XLAT.
(unless optimizing for size)

Unroll instead.

No MMX here.
(maybe non-temporal stores if dest <> source)
Posted on 2004-03-18 21:01:38 by bitRAKE
What about prefetching the index table? Might give some improvement?
Posted on 2004-03-19 07:31:38 by f0dder
sorry, can u give me a detail solution
Posted on 2004-03-19 11:31:11 by kerrylau
next:

movzx eax, BYTE PTR [esi+3]
mov bl, [edi+eax]

movzx eax, BYTE PTR [esi+2]
mov bh, [edi+eax]

bswap ebx

movzx eax, BYTE PTR [esi+1]
mov bh, [edi+eax]

movzx eax, BYTE PTR [esi+0]
mov bl, [edi+eax]

mov [esi], ebx
add esi, 4
unroll further to remove all dependancies
Posted on 2004-03-20 08:32:13 by bitRAKE
bitRAKE:

I'm just curious, I've just started studying how to optimize for x86 processors and... doesn't this code introduce lots of "partial stalls" or whatever it is called, because of writing to partial registers and modifying/reading it back afterwards?

I have lots to learn yet!!!
Posted on 2004-03-24 09:31:19 by persil
persil, should not be a problem if unrolled to 16 bytes per loop. Use BSWAP or a shift instruction depending on what works best on your target processor. I was just putting some code down for us to talk about as kerrylau seemed too timid to post his work.
Posted on 2004-03-25 06:16:22 by bitRAKE
ok, I didn't want to imply anything, just that I thought it wasn't a good thing to work with parts of registers because then the CPU had to physically write the register before being able to read it back... I just wanted to know if what I'm thinking is right!!! No offense or anything :(
Posted on 2004-03-25 18:14:55 by persil
persil, no offense taken. :) I meant to imply you are correct in your understanding - oh, I say too little sometimes, sorry. The delay penalty is processor specific in my experience, but there will be a delay. I used MOVZX to reduce this problem as much as possible. Some people might align the data structure and try something like:


next:
mov al, BYTE PTR [esi+3]
mov bl, [eax]

mov al, BYTE PTR [esi+2]
mov bh, [eax]

bswap ebx

mov al, BYTE PTR [esi+1]
mov bh, [eax]

mov al, BYTE PTR [esi+0]
mov bl, [eax]

mov [esi], ebx
add esi, 4
...give it a try, but I am quite certain this is slower. (I mean to imply that this code should be unrolled and compared to the above code unrolled.)
Posted on 2004-03-25 19:45:33 by bitRAKE
Yeah, ok thanks :) I believe you 100%...

In fact, the last code you wrote, as you say it would be slow, I guess it would indeed be excruciatingly slow on a P4, if it relates to a stupid mistake I did a couple of days ago :)

The thing I'm still not sure is: I know there is a penalty for calculating an address from a register which has just been modified, I'm sure of that. I think it's an AGI??

And I know that there's also a penalty for using a 32bits register when just having written in it a partial value (as in 8bits or 16bits). So, in the code, would mov , ebx suffer from this penalty? Is it not as worse as an AGI?
Posted on 2004-03-25 19:56:31 by persil