i have 16x 32bit unsigned integers in xmm4-xmm7

now i need to cast them to 16x 16bit unsigned shorts.

I tryed to subtract all double words with (0xFFFF0000) wich worked, but if i use packssdw i get overruns to 7FFF since its singed .. there is no unsigned version :(

So some1 have a clue how to handle this fast?

I came up with 4x shufhw and shuflw

[00004444|00003333|00002222|00001111]

[00008888|00007777|00006666|00005555]

to

[00000000|44443333|22221111|22221111]

[00000000|88887777|66665555|66665555]

and a quadword shif right to

[00000000|44443333|22221111|22221111]

[00000000|00000000|44443333|22221111]

.

.

.

and than combine

[00000000|00000000|44443333|22221111]

[00000000|00000000|88887777|66665555]

to

[88887777|66665555|44443333|22221111]

wich resluts in 8x shufle, 4x shifts and 2x combine

kinda lots of cycles for a simple cast...

some1 got a other idea?

now i need to cast them to 16x 16bit unsigned shorts.

I tryed to subtract all double words with (0xFFFF0000) wich worked, but if i use packssdw i get overruns to 7FFF since its singed .. there is no unsigned version :(

So some1 have a clue how to handle this fast?

I came up with 4x shufhw and shuflw

[00004444|00003333|00002222|00001111]

[00008888|00007777|00006666|00005555]

to

[00000000|44443333|22221111|22221111]

[00000000|88887777|66665555|66665555]

and a quadword shif right to

[00000000|44443333|22221111|22221111]

[00000000|00000000|44443333|22221111]

.

.

.

and than combine

[00000000|00000000|44443333|22221111]

[00000000|00000000|88887777|66665555]

to

[88887777|66665555|44443333|22221111]

wich resluts in 8x shufle, 4x shifts and 2x combine

kinda lots of cycles for a simple cast...

some1 got a other idea?

What about using pand to remove the highword of each dword, and then do the packssdw? It should not saturate then, since there is nothing to saturate.

Or if you need the saturation, only remove the topmost bit. Then everything is considered positive, and saturation only goes upwards.

Or if you need the saturation, only remove the topmost bit. Then everything is considered positive, and saturation only goes upwards.

Rather than eschewing packssdw altogether you could use it but deal with the high bits separately:

So if xmm4 were 00001111000022220000333300004444

and xmm5 were 00005555000066660000777700008888

then you end up with xmm5 as 11112222333344445555666677778888

```
```

.data

align 16

owMask dd 4 dup(00007FFFh)

.code

; save high bits

movdqa xmm0,xmm4

psrlw xmm0,15

movdqa xmm1,xmm5

psrlw xmm1,15

; remove high bits

pand xmm4,owMask

pand xmm5,owMask

; do the pack

packssdw xmm5,xmm4

; pack the high bits

packssdw xmm1,xmm0

; put into place

psllw xmm1,15

; combine

por xmm5,xmm1

So if xmm4 were 00001111000022220000333300004444

and xmm5 were 00005555000066660000777700008888

then you end up with xmm5 as 11112222333344445555666677778888

That wouldn't work. A value of 8000h and up would still saturate.

Maybe this could work? (I've never done any programming with these modern instructions :P)

Maybe this could work? (I've never done any programming with these modern instructions :P)

```
(at beginning)
```

mov eax,8000h

movd xmm0,eax

pshuflw xmm1,xmm0,0

pshufd xmm0,xmm0,0

movlhps xmm1,xmm1

...

psubd xmm4,xmm0

psubd xmm5,xmm0

packssdw xmm4,xmm5

psubd xmm6,xmm0

psubd xmm7,xmm0

packssdw xmm6,xmm7

pxor xmm4,xmm1

pxor xmm6,xmm1

That wouldn't work. A value of 8000h and up would still saturate.

Then I suppose you need to sign-extend the word in order to make it not saturate, instead of just tossing out the high bits. It's late. And I don't really care :P

You can do this by simply using a shift left (pslld) by 16, and then an arithmetic shift right by 16 (psrad).

That is probably the most elegant solution.

Something like:

```
```

pslld xmm4, 16

psrad xmm4, 16

packssdw xmm4, xmm4

And please, comment your code for a change?

A value that is between 8000h and 0ffffh is outside the signed range, and won't fit. Andy2222 wants to pack unsigned words, however.

I am subtracting 8000h from every doubleword, so it will be between 0ffff8000h and 7fffh, then I pack and xor all the words with 8000h to get them back to what they were.

It seemed that the high words would be zero on entry. If that's not the case, then your last solution might be the best.

I am subtracting 8000h from every doubleword, so it will be between 0ffff8000h and 7fffh, then I pack and xor all the words with 8000h to get them back to what they were.

It seemed that the high words would be zero on entry. If that's not the case, then your last solution might be the best.

Rather than eschewing packssdw altogether you could use it but deal with the high bits separately:

```
```

.data

align 16

owMask dd 4 dup(00007FFFh)

.code

; save high bits

movdqa xmm0,xmm4

psrlw xmm0,15

movdqa xmm1,xmm5

psrlw xmm1,15

; remove high bits

pand xmm4,owMask

pand xmm5,owMask

; do the pack

packssdw xmm5,xmm4

; pack the high bits

packssdw xmm1,xmm0

; put into place

psllw xmm1,15

; combine

por xmm5,xmm1

So if xmm4 were 00001111000022220000333300004444

and xmm5 were 00005555000066660000777700008888

then you end up with xmm5 as 11112222333344445555666677778888

this works but u end up wih [44443333222211115555666677778888]

@Sephiroth3 i will test your solution too, and yes the main problem is the packssdw with word's higher than 8000h

PS: i came up with this solution meanwhile myself

pshuflw xmm4,xmm4,136

pshuflw xmm5,xmm5,136

pshuflw xmm6,xmm6,136

pshuflw xmm7,xmm7,136

pshufhw xmm4,xmm4,136

pshufhw xmm5,xmm5,136

pshufhw xmm6,xmm6,136

pshufhw xmm7,xmm7,136

psrldq xmm4, 4

psrldq xmm5, 4

psrldq xmm6, 4

psrldq xmm7, 4

punpcklqdq xmm4, xmm5

punpcklqdq xmm6, xmm7

The problem is the packssdw instruction wich takes 4 cycles and is vectorized, so maybe using only pshuf and shift will result in a faster conversion. I have to test this.

Are you sure you didn't forget about endianness? I.e.

```
```

owLine1 dw 4444h,0,3333h,0,2222h,0,1111h,0

movdqa xmm0,owLine1

; xmm0 is now 00001111000022220000333300004444

jup my fault :)

It seemed that the high words would be zero on entry. If that's not the case, then your last solution might be the best.

That depends on whether the high word is to be considered garbage or that the value is actually larger than a single word and requires saturation.

If no saturation, my routine, if yes saturation, your routine. But you could implement it much smaller if you just stored the entire 8000h-word in memory. Would lose the push/pop/shuf etc.

```
b8 00 80 00 00 mov eax,8000h
```

66 0f 6e c0 movd xmm0,eax

f2 0f 70 c8 00 pshuflw xmm1,xmm0,0

66 0f 70 c0 00 pshufd xmm0,xmm0,0

0F 16 C9 movlhps xmm1,xmm1

I just realized I could use MOVD, and saved two more bytes...

It takes 14 bytes to set up XMM0, and 8 bytes to set up XMM1. Loading both from memory would use 24 bytes for each. Using the memory directly in the other operations would require 32 bytes for XMM0 and 24 bytes for XMM1.

If saturation is desired, then it won't be so simple actually, since the 32-bit integers were unsigned. The following code has to be used. Unfortunately, it is quite long:

```
```

mov eax,8000h

movd xmm0,eax

pshuflw xmm1,xmm0,0

pshufd xmm0,xmm0,0

movlhps xmm1,xmm1

...

movdqa xmm2,xmm4

movdqa xmm3,xmm5

psubd xmm4,xmm0

psrad xmm2,31

psubd xmm5,xmm0

psrad xmm3,31

packssdw xmm4,xmm5

packssdw xmm2,xmm3

pxor xmm4,xmm1

pxor xmm5,xmm1

por xmm4,xmm2

por xmm5,xmm3

movdqa xmm2,xmm6

movdqa xmm3,xmm7

psubd xmm6,xmm0

psrad xmm2,31

psubd xmm7,xmm0

psrad xmm3,31

packssdw xmm6,xmm7

packssdw xmm2,xmm3

pxor xmm6,xmm1

pxor xmm7,xmm1

por xmm6,xmm2

por xmm7,xmm3

Maybe there's a better way?

im looking for the fastest way btw, since i have to do these in a video loop ((resolution*yres + res/2*y/2 + res/2*y/2) resolution*25 frames pers second) since we do

for a 800x600 video we get 18000000 pixels to work with, per second

wich means every cycle counts memory size is irrelevant :)

It seems that a implementation without using packssdw is faster since the vectorized instruction blocks (at least on an AMD) the following instructions.

for a 800x600 video we get 18000000 pixels to work with, per second

wich means every cycle counts memory size is irrelevant :)

It seems that a implementation without using packssdw is faster since the vectorized instruction blocks (at least on an AMD) the following instructions.

You could modify my version with saturation I suppose.

Use pcmpdg or whatever it was to test if the values are greater than FFFFh. The result of that pcmp is 0 for each dword that is <= FFFF and FFFFFFFFh for each dword that is > FFFF.

Now you or those two together, effectively the values are saturated now. Then do the shifts and pack as before.

Use pcmpdg or whatever it was to test if the values are greater than FFFFh. The result of that pcmp is 0 for each dword that is <= FFFF and FFFFFFFFh for each dword that is > FFFF.

Now you or those two together, effectively the values are saturated now. Then do the shifts and pack as before.

two-complement sux :)

(sorry :)

(sorry :)