Hi all,

I have an SSE register with 4 x 32-bit numbers, and I'd like to shift them by four different values in another SSE register.

Unfortunately it looks like pslld/psrld/psrad all shift the elements by the same value. I can achieve what I want by writing one element and one shift value to another register, shifting that, and repeating this four times, but that seems really slow. It also defeats the purpose of SIMD. So does anyone know any tricks to speed this up?

I need this to convert 32-bit floating-point numbers to 16-bit floating-point numbers.

Thanks,

Nicolas

I have an SSE register with 4 x 32-bit numbers, and I'd like to shift them by four different values in another SSE register.

Unfortunately it looks like pslld/psrld/psrad all shift the elements by the same value. I can achieve what I want by writing one element and one shift value to another register, shifting that, and repeating this four times, but that seems really slow. It also defeats the purpose of SIMD. So does anyone know any tricks to speed this up?

I need this to convert 32-bit floating-point numbers to 16-bit floating-point numbers.

Thanks,

Nicolas

I think I found a way... I suddenly realized that shifting is the same as multiplication by a power of two. To convert my shift values to a power of two I can put them into the exponents of floating point numbers, and convert the elements to be shifted to floating-point as well. After multiplication I just convert back to integer. 8)

Does anyone know whether there's any performance impact for treating integer data as floating-point? Or does it all get processed by the same execution pipelines?

Does anyone know whether there's any performance impact for treating integer data as floating-point? Or does it all get processed by the same execution pipelines?

hi,

could you post your solution you descriped?

I use the following code for doing this job:

regards,

qWord

could you post your solution you descriped?

I use the following code for doing this job:

;shift doublewords left (** or right)

;xmm0 = 4 x 32-bit numbers

;xmm1 = 4 x shift-values

;xmm2-3 = free

;OUT: xmm2

movdqa xmm3,xmm1

punpckhqdq xmm1,xmm0

punpcklqdq xmm3,xmm0

pshufd xmm1,xmm1,10110100y

pshufd xmm3,xmm3,10110100y

pshufd xmm0,xmm1,010110001y

pshufd xmm2,xmm3,010110001y

;create msk. => |XXXX|XXXX|0000|XXXX|

pcmpeqd xmm4,xmm4 ;-

pslldq xmm4,4 ; |- not needed if mem.-operand is used

pshufd xmm4,xmm4,11100001y ;-

pand xmm0,xmm4 ; pand xmm0,OWORD ptr msk

pslld xmm0,xmm0 ;psrld **

pand xmm1,xmm4 ; ..,OWORD ptr msk

pslld xmm1,xmm1 ;psrld **

pand xmm2,xmm4 ; ..,OWORD ptr msk

pslld xmm2,xmm2 ;psrld **

pand xmm3,xmm4 ; ..,OWORD ptr msk

pslld xmm3,xmm3 ;psrld **

psrldq xmm1,4

psrldq xmm3,4

por xmm0,xmm1

por xmm2,xmm3

punpckhqdq xmm2,xmm0

regards,

qWord

I think I found a way... I suddenly realized that shifting is the same as multiplication by a power of two

Great.

Now what will the world think, when i spread this quote (out of context, of course) around the world and say you've got 200 messages on a board dedicated to assembly optimizing, and that you wrote a software rasterizer featuring D3D shaders on CPU, dynamic code, S.I.M.D. , self compilation and "M.K.A.R.F.P.O. - x86" ?

(

"M.K.A.R.F.P.O. - x86" :

"Major Kick-a** Roxxing-Fast Performance Ownage on x86 processorz" (c) 2007 , H.S.A.E.

(H.S.A.E. : "HelloWorld Sarcastic Acronyms Enterprises")

)

could you post your solution you descriped?

I posted it in the DevMaster.net Daily Code Gem, with a practical use of the method.

Now what will the world think, when i spread this quote (out of context, of course) around the world and say you've got 200 messages on a board dedicated to assembly optimizing, and that you wrote a software rasterizer featuring D3D shaders on CPU, dynamic code, S.I.M.D. , self compilation and "M.K.A.R.F.P.O. - x86" ?

It would make me as famous as Newton with his apple. ;)