Hi all,

I'm looking for a replacement of the packusdw MMX instruction, because it simply doesn't exist... Obviously it's easy to find an equivalent code sequency, but I'm really searching for the fastest way possible.

Ok, the problem is that I have 4 floating-point values in an SSE register, and I want to store them as 0.16 fixed-point unsigned integers in an MMX register. If the packusdw instruction existed, this would be as simple (and fast) as:

mulps xmm0, _65536
cvtps2pi mm0, xmm0
movhlps xmm0, xmm0
cvtps2pi mm1, xmm0
packusdw mm0, mm1

Floating-point numbers outside the [0, 1] range should wrap around, not be saturated, as expected for a packusdw instruction. So it merely has to select the lowest 16-bit from the doublewords and pack them together. The best equivalent I found so far is:

pshufw mm0, mm0, 0x08
pshufw mm1, mm1, 0x08
punpckldq mm0, mm1

Unfortunately this is the greatest bottleneck in my code. The mulplication by 65536 is actually done early on, so the lack of packusdw makes my code 50% longer. :shock: So I really need an alternative way of doing this.

If you got any ideas, anything at all, please let me know! Thanks.
Posted on 2004-10-20 08:17:05 by C0D1F1ED
Anyone? This is for texture mapping.

The complete calculation is that I have four u-coordinates in one SSE register, and four v-coordinates in another. They have to be converted to integer and combined together to get the offset in the texture map. Then sixteen texels are read, multiplied by the integer fractions for bilinear filtering, and combined to form four samples. My 'best' result so far 90 clock cycles -per sample- on a highly efficient Pentium M. :cry: What bothers me is that there's only 180 instructions (for four), so the processor executes one instruction every two clock cycles. I hoped it was the other way around...

I'm aiming for 75 clock cycles, so if anyone has any idea to optimize these steps, please let me know!
Posted on 2004-10-21 04:25:08 by C0D1F1ED