Hi all,

I have an SSE register with 4 x 32-bit numbers, and I'd like to shift them by four different values in another SSE register.

Unfortunately it looks like pslld/psrld/psrad all shift the elements by the same value. I can achieve what I want by writing one element and one shift value to another register, shifting that, and repeating this four times, but that seems really slow. It also defeats the purpose of SIMD. So does anyone know any tricks to speed this up?

I need this to convert 32-bit floating-point numbers to 16-bit floating-point numbers.

Thanks,

Nicolas
Posted on 2007-09-17 02:24:42 by C0D1F1ED
I think I found a way... I suddenly realized that shifting is the same as multiplication by a power of two. To convert my shift values to a power of two I can put them into the exponents of floating point numbers, and convert the elements to be shifted to floating-point as well. After multiplication I just convert back to integer. 8)

Does anyone know whether there's any performance impact for treating integer data as floating-point? Or does it all get processed by the same execution pipelines?
Posted on 2007-09-18 01:53:13 by C0D1F1ED
hi,

could you post your solution you descriped?
I use the following code for doing this job:

  ;shift doublewords left (** or right)
  ;xmm0  = 4 x 32-bit numbers
  ;xmm1  = 4 x shift-values
  ;xmm2-3 = free
  ;OUT: xmm2
  movdqa xmm3,xmm1
  punpckhqdq xmm1,xmm0
  punpcklqdq xmm3,xmm0
  pshufd xmm1,xmm1,10110100y
  pshufd xmm3,xmm3,10110100y
  pshufd xmm0,xmm1,010110001y
  pshufd xmm2,xmm3,010110001y
      ;create msk. => |XXXX|XXXX|0000|XXXX|
      pcmpeqd xmm4,xmm4                ;-
      pslldq xmm4,4                     ; |- not needed if mem.-operand is used
      pshufd xmm4,xmm4,11100001y        ;-
  pand xmm0,xmm4                        ; pand xmm0,OWORD ptr msk
  pslld xmm0,xmm0 ;psrld **
  pand xmm1,xmm4                        ; ..,OWORD ptr msk
  pslld xmm1,xmm1 ;psrld **
  pand xmm2,xmm4                        ; ..,OWORD ptr msk
  pslld xmm2,xmm2 ;psrld **
  pand xmm3,xmm4                        ; ..,OWORD ptr msk
  pslld xmm3,xmm3 ;psrld **
  psrldq xmm1,4
  psrldq xmm3,4
  por xmm0,xmm1
  por xmm2,xmm3
  punpckhqdq xmm2,xmm0


regards,
qWord
Posted on 2007-09-18 18:14:54 by qWord
I think I found a way... I suddenly realized that shifting is the same as multiplication by a power of two


Great.

Now what will the world think, when i spread this quote (out of context, of course) around the world and say you've got 200 messages on a board dedicated to assembly optimizing, and that you wrote a software rasterizer featuring D3D shaders on CPU, dynamic code, S.I.M.D. , self compilation and "M.K.A.R.F.P.O. - x86" ?

(
"M.K.A.R.F.P.O. - x86" :
"Major Kick-a** Roxxing-Fast Performance Ownage on x86 processorz" (c) 2007 , H.S.A.E.

(H.S.A.E. : "HelloWorld Sarcastic Acronyms Enterprises")

)

Posted on 2007-09-25 04:38:49 by HeLLoWorld

could you post your solution you descriped?

I posted it in the DevMaster.net Daily Code Gem, with a practical use of the method.
Now what will the world think, when i spread this quote (out of context, of course) around the world and say you've got 200 messages on a board dedicated to assembly optimizing, and that you wrote a software rasterizer featuring D3D shaders on CPU, dynamic code, S.I.M.D. , self compilation and "M.K.A.R.F.P.O. - x86" ?

It would make me as famous as Newton with his apple. ;)
Posted on 2007-09-25 07:29:46 by C0D1F1ED