hi all,
im calcing some indexes for a array using SIMD (mmx/sse2) the c++ code looks like this:
int d;
d = (a - b)<<16;
d = (c + g)>>12;
.
.
.
return h + coef;
The asm code calc's "d" using SIMD so i end up having 4, 8 or 16 "d" in mmx or xmm reg's. (coef holds values wich are precalced for "d")
its inline asm
_asm{
.
.
.
calcing "d"
movd eax,mm2 //fixme AMD penalty 5-10 cycles
movd ecx,mm3 //fixme AMD penalty 5-10 cycles
punpckhdq mm2,mm2 //HD into LD
punpckhdq mm3,mm3 //HD into LD
movd ebx,mm2 //fixme AMD penalty 5-10 cycles
movd edx,mm3 //fixme AMD penalty 5-10 cycles
movd mm6,
movd mm5,
punpckldq mm6,
punpckldq mm5,
---> moving memory values based on "d" into mmx regs
paddd mm6,mm0
paddd mm5,mm1
.
.
.
}
The problem is that u get a huge penalty using "movd ebx,mm2" on AMD's and even if i work with a Temp __m64 adress the code is slow.
Some1 know a better way to calc a index and than accessing these memory adresses and getting them again in a mmx reg? Maybe using a array of pointers or so?
Or some1 know a way to load/store the values without huge penaltys and load/store foreward stalls?
im calcing some indexes for a array using SIMD (mmx/sse2) the c++ code looks like this:
int d;
d = (a - b)<<16;
d = (c + g)>>12;
.
.
.
return h + coef;
The asm code calc's "d" using SIMD so i end up having 4, 8 or 16 "d" in mmx or xmm reg's. (coef holds values wich are precalced for "d")
its inline asm
_asm{
.
.
.
calcing "d"
movd eax,mm2 //fixme AMD penalty 5-10 cycles
movd ecx,mm3 //fixme AMD penalty 5-10 cycles
punpckhdq mm2,mm2 //HD into LD
punpckhdq mm3,mm3 //HD into LD
movd ebx,mm2 //fixme AMD penalty 5-10 cycles
movd edx,mm3 //fixme AMD penalty 5-10 cycles
movd mm6,
movd mm5,
punpckldq mm6,
punpckldq mm5,
---> moving memory values based on "d" into mmx regs
paddd mm6,mm0
paddd mm5,mm1
.
.
.
}
The problem is that u get a huge penalty using "movd ebx,mm2" on AMD's and even if i work with a Temp __m64 adress the code is slow.
Some1 know a better way to calc a index and than accessing these memory adresses and getting them again in a mmx reg? Maybe using a array of pointers or so?
Or some1 know a way to load/store the values without huge penaltys and load/store foreward stalls?