hi all,

im calcing some indexes for a array using SIMD (mmx/sse2) the c++ code looks like this:

int d;

d = (a - b)<<16;
d = (c + g)>>12;
.
.
.
return h + coef;

The asm code calc's "d" using SIMD so i end up having 4, 8 or 16 "d" in mmx or xmm reg's. (coef holds values wich are precalced for "d")

its inline asm

_asm{

.
.
.
calcing "d"


movd eax,mm2 //fixme AMD penalty 5-10 cycles
movd ecx,mm3 //fixme AMD penalty 5-10 cycles

punpckhdq mm2,mm2 //HD into LD
punpckhdq mm3,mm3 //HD into LD

movd ebx,mm2 //fixme AMD penalty 5-10 cycles
movd edx,mm3 //fixme AMD penalty 5-10 cycles

movd mm6,
movd mm5,

punpckldq mm6,
punpckldq mm5,

---> moving memory values based on "d" into mmx regs

paddd mm6,mm0
paddd mm5,mm1

.
.
.
}

The problem is that u get a huge penalty using "movd ebx,mm2" on AMD's and even if i work with a Temp __m64 adress the code is slow.
Some1 know a better way to calc a index and than accessing these memory adresses and getting them again in a mmx reg? Maybe using a array of pointers or so?

Or some1 know a way to load/store the values without huge penaltys and load/store foreward stalls?
Posted on 2004-06-05 06:18:34 by Andy2222
1) Use the ALU instead of MMX/SSE.
2) Try if memory is faster than eax on AMD.
3) Buy an Intel CPU and ignore broken clones.
Posted on 2004-06-05 06:43:24 by Scali

1) Use the ALU instead of MMX/SSE.
2) Try if memory is faster than eax on AMD.
3) Buy an Intel CPU and ignore broken clones.


1: its much slower with ALU since we calc quite a bit before we get final d for mem access.

2: tryed but wasnt faster prolly some memory stalls, how would the asm code look like with memory access and correct sheduling? I mean write all 2/4 "d" back using a __m128i*/__m64* or writing them back int by int? Since we will read those values than again to fill new mmx/xmm regs. So how the CPU can perfect use out of order + forewarding?

3: AMD64 has a latency of 9 (vector pathed), P4 has latency of 10 using xmm regs, only for movd and mmx regs the P4 is faster, but P4 needs 6 cycles for a simple movq mmx,mmx ....
Posted on 2004-06-05 07:05:28 by Andy2222
1/2: I don't have enough information to comment on that. I have no idea what calculations you do. Frankly I don't care either. You should solve your own problems, I just point out some areas for you to explore.

3: You are overlooking the obvious: P4 has a LOT more cycles than AMD64 per second. It can afford to take more cycles per instruction and still beat the crap out of AMD64.
Posted on 2004-06-05 07:33:45 by Scali