Hi all,

I'm looking for a replacement of the packusdw MMX instruction, because it simply doesn't exist... Obviously it's easy to find an equivalent code sequency, but I'm really searching for the fastest way possible.

Ok, the problem is that I have 4 floating-point values in an SSE register, and I want to store them as 0.16 fixed-point unsigned integers in an MMX register. If the packusdw instruction existed, this would be as simple (and fast) as:

Floating-point numbers outside the [0, 1] range should wrap around, not be saturated, as expected for a packusdw instruction. So it merely has to select the lowest 16-bit from the doublewords and pack them together. The best equivalent I found so far is:

Unfortunately this is the greatest bottleneck in my code. The mulplication by 65536 is actually done early on, so the lack of packusdw makes my code 50% longer. :shock: So I really need an alternative way of doing this.

If you got any ideas, anything at all, please let me know! Thanks.

I'm looking for a replacement of the packusdw MMX instruction, because it simply doesn't exist... Obviously it's easy to find an equivalent code sequency, but I'm really searching for the fastest way possible.

Ok, the problem is that I have 4 floating-point values in an SSE register, and I want to store them as 0.16 fixed-point unsigned integers in an MMX register. If the packusdw instruction existed, this would be as simple (and fast) as:

```
```

mulps xmm0, _65536

cvtps2pi mm0, xmm0

movhlps xmm0, xmm0

cvtps2pi mm1, xmm0

packusdw mm0, mm1

Floating-point numbers outside the [0, 1] range should wrap around, not be saturated, as expected for a packusdw instruction. So it merely has to select the lowest 16-bit from the doublewords and pack them together. The best equivalent I found so far is:

```
```

pshufw mm0, mm0, 0x08

pshufw mm1, mm1, 0x08

punpckldq mm0, mm1

Unfortunately this is the greatest bottleneck in my code. The mulplication by 65536 is actually done early on, so the lack of packusdw makes my code 50% longer. :shock: So I really need an alternative way of doing this.

If you got any ideas, anything at all, please let me know! Thanks.

Anyone? This is for texture mapping.

The complete calculation is that I have four u-coordinates in one SSE register, and four v-coordinates in another. They have to be converted to integer and combined together to get the offset in the texture map. Then sixteen texels are read, multiplied by the integer fractions for bilinear filtering, and combined to form four samples. My 'best' result so far 90 clock cycles -per sample- on a highly efficient Pentium M. :cry: What bothers me is that there's only 180 instructions (for four), so the processor executes one instruction every two clock cycles. I hoped it was the other way around...

I'm aiming for 75 clock cycles, so if anyone has any idea to optimize these steps, please let me know!

The complete calculation is that I have four u-coordinates in one SSE register, and four v-coordinates in another. They have to be converted to integer and combined together to get the offset in the texture map. Then sixteen texels are read, multiplied by the integer fractions for bilinear filtering, and combined to form four samples. My 'best' result so far 90 clock cycles -per sample- on a highly efficient Pentium M. :cry: What bothers me is that there's only 180 instructions (for four), so the processor executes one instruction every two clock cycles. I hoped it was the other way around...

I'm aiming for 75 clock cycles, so if anyone has any idea to optimize these steps, please let me know!