i have 16x 32bit unsigned integers in xmm4-xmm7

now i need to cast them to 16x 16bit unsigned shorts.

I tryed to subtract all double words with (0xFFFF0000) wich worked, but if i use packssdw i get overruns to 7FFF since its singed .. there is no unsigned version :(

So some1 have a clue how to handle this fast?

I came up with 4x shufhw and shuflw

[00004444|00003333|00002222|00001111]
[00008888|00007777|00006666|00005555]
to
[00000000|44443333|22221111|22221111]
[00000000|88887777|66665555|66665555]
and a quadword shif right to
[00000000|44443333|22221111|22221111]
[00000000|00000000|44443333|22221111]
.
.
.
and than combine
[00000000|00000000|44443333|22221111]
[00000000|00000000|88887777|66665555]
to
[88887777|66665555|44443333|22221111]

wich resluts in 8x shufle, 4x shifts and 2x combine

kinda lots of cycles for a simple cast...

some1 got a other idea?
Posted on 2004-06-15 14:18:49 by Andy2222
What about using pand to remove the highword of each dword, and then do the packssdw? It should not saturate then, since there is nothing to saturate.
Or if you need the saturation, only remove the topmost bit. Then everything is considered positive, and saturation only goes upwards.
Posted on 2004-06-15 16:47:57 by Scali
Rather than eschewing packssdw altogether you could use it but deal with the high bits separately:



.data

align 16
owMask dd 4 dup(00007FFFh)

.code

; save high bits
movdqa xmm0,xmm4
psrlw xmm0,15
movdqa xmm1,xmm5
psrlw xmm1,15

; remove high bits
pand xmm4,owMask
pand xmm5,owMask

; do the pack
packssdw xmm5,xmm4

; pack the high bits
packssdw xmm1,xmm0
; put into place
psllw xmm1,15

; combine
por xmm5,xmm1


So if xmm4 were 00001111000022220000333300004444
and xmm5 were 00005555000066660000777700008888

then you end up with xmm5 as 11112222333344445555666677778888
Posted on 2004-06-15 17:21:46 by stormix
That wouldn't work. A value of 8000h and up would still saturate.

Maybe this could work? (I've never done any programming with these modern instructions :P)
(at beginning)

mov eax,8000h
movd xmm0,eax
pshuflw xmm1,xmm0,0
pshufd xmm0,xmm0,0
movlhps xmm1,xmm1
...
psubd xmm4,xmm0
psubd xmm5,xmm0
packssdw xmm4,xmm5
psubd xmm6,xmm0
psubd xmm7,xmm0
packssdw xmm6,xmm7
pxor xmm4,xmm1
pxor xmm6,xmm1
Posted on 2004-06-15 17:27:35 by Sephiroth3
That wouldn't work. A value of 8000h and up would still saturate.


Then I suppose you need to sign-extend the word in order to make it not saturate, instead of just tossing out the high bits. It's late. And I don't really care :P
You can do this by simply using a shift left (pslld) by 16, and then an arithmetic shift right by 16 (psrad).
That is probably the most elegant solution.

Something like:



pslld xmm4, 16
psrad xmm4, 16
packssdw xmm4, xmm4


And please, comment your code for a change?
Posted on 2004-06-15 17:47:03 by Scali
A value that is between 8000h and 0ffffh is outside the signed range, and won't fit. Andy2222 wants to pack unsigned words, however.
I am subtracting 8000h from every doubleword, so it will be between 0ffff8000h and 7fffh, then I pack and xor all the words with 8000h to get them back to what they were.

It seemed that the high words would be zero on entry. If that's not the case, then your last solution might be the best.
Posted on 2004-06-15 18:00:51 by Sephiroth3

Rather than eschewing packssdw altogether you could use it but deal with the high bits separately:



.data

align 16
owMask dd 4 dup(00007FFFh)

.code

; save high bits
movdqa xmm0,xmm4
psrlw xmm0,15
movdqa xmm1,xmm5
psrlw xmm1,15

; remove high bits
pand xmm4,owMask
pand xmm5,owMask

; do the pack
packssdw xmm5,xmm4

; pack the high bits
packssdw xmm1,xmm0
; put into place
psllw xmm1,15

; combine
por xmm5,xmm1


So if xmm4 were 00001111000022220000333300004444
and xmm5 were 00005555000066660000777700008888

then you end up with xmm5 as 11112222333344445555666677778888


this works but u end up wih [44443333222211115555666677778888]

@Sephiroth3 i will test your solution too, and yes the main problem is the packssdw with word's higher than 8000h

PS: i came up with this solution meanwhile myself

pshuflw xmm4,xmm4,136
pshuflw xmm5,xmm5,136
pshuflw xmm6,xmm6,136
pshuflw xmm7,xmm7,136

pshufhw xmm4,xmm4,136
pshufhw xmm5,xmm5,136
pshufhw xmm6,xmm6,136
pshufhw xmm7,xmm7,136

psrldq xmm4, 4
psrldq xmm5, 4
psrldq xmm6, 4
psrldq xmm7, 4

punpcklqdq xmm4, xmm5
punpcklqdq xmm6, xmm7

The problem is the packssdw instruction wich takes 4 cycles and is vectorized, so maybe using only pshuf and shift will result in a faster conversion. I have to test this.
Posted on 2004-06-15 18:04:56 by Andy2222
Are you sure you didn't forget about endianness? I.e.



owLine1 dw 4444h,0,3333h,0,2222h,0,1111h,0

movdqa xmm0,owLine1

; xmm0 is now 00001111000022220000333300004444
Posted on 2004-06-15 18:10:46 by stormix
jup my fault :)
Posted on 2004-06-15 18:18:38 by Andy2222
It seemed that the high words would be zero on entry. If that's not the case, then your last solution might be the best.


That depends on whether the high word is to be considered garbage or that the value is actually larger than a single word and requires saturation.
If no saturation, my routine, if yes saturation, your routine. But you could implement it much smaller if you just stored the entire 8000h-word in memory. Would lose the push/pop/shuf etc.
Posted on 2004-06-16 02:14:17 by Scali
b8 00 80 00 00 mov eax,8000h

66 0f 6e c0 movd xmm0,eax
f2 0f 70 c8 00 pshuflw xmm1,xmm0,0
66 0f 70 c0 00 pshufd xmm0,xmm0,0
0F 16 C9 movlhps xmm1,xmm1

I just realized I could use MOVD, and saved two more bytes...
It takes 14 bytes to set up XMM0, and 8 bytes to set up XMM1. Loading both from memory would use 24 bytes for each. Using the memory directly in the other operations would require 32 bytes for XMM0 and 24 bytes for XMM1.

If saturation is desired, then it won't be so simple actually, since the 32-bit integers were unsigned. The following code has to be used. Unfortunately, it is quite long:


mov eax,8000h
movd xmm0,eax
pshuflw xmm1,xmm0,0
pshufd xmm0,xmm0,0
movlhps xmm1,xmm1
...
movdqa xmm2,xmm4
movdqa xmm3,xmm5
psubd xmm4,xmm0
psrad xmm2,31
psubd xmm5,xmm0
psrad xmm3,31
packssdw xmm4,xmm5
packssdw xmm2,xmm3
pxor xmm4,xmm1
pxor xmm5,xmm1
por xmm4,xmm2
por xmm5,xmm3
movdqa xmm2,xmm6
movdqa xmm3,xmm7
psubd xmm6,xmm0
psrad xmm2,31
psubd xmm7,xmm0
psrad xmm3,31
packssdw xmm6,xmm7
packssdw xmm2,xmm3
pxor xmm6,xmm1
pxor xmm7,xmm1
por xmm6,xmm2
por xmm7,xmm3

Maybe there's a better way?
Posted on 2004-06-16 07:17:23 by Sephiroth3
im looking for the fastest way btw, since i have to do these in a video loop ((resolution*yres + res/2*y/2 + res/2*y/2) resolution*25 frames pers second) since we do

for a 800x600 video we get 18000000 pixels to work with, per second

wich means every cycle counts memory size is irrelevant :)

It seems that a implementation without using packssdw is faster since the vectorized instruction blocks (at least on an AMD) the following instructions.
Posted on 2004-06-16 07:33:15 by Andy2222
You could modify my version with saturation I suppose.
Use pcmpdg or whatever it was to test if the values are greater than FFFFh. The result of that pcmp is 0 for each dword that is <= FFFF and FFFFFFFFh for each dword that is > FFFF.
Now you or those two together, effectively the values are saturated now. Then do the shifts and pack as before.
Posted on 2004-06-16 07:55:45 by Scali
two-complement sux :)
(sorry :)
Posted on 2004-06-16 15:31:43 by HeLLoWorld