im having a little arithmetic problem while using SSE2 code

i have 16x 32bit values in xmm0 - xmm3

Stage0:

xmm0
xmm1
xmm2
xmm3

i need these values now in this format

Stage1:

xmm0
xmm1
xmm2
xmm3

i have to do some more math than like

Stage2:

xmm0
+ + + +
xmm1
+ + + +
xmm2
+ + + +
xmm3

.
.
.

X0 = (A0 + B0 + C0 +D0)
X1 = (A1 + B1 + C1 +D1)
X2 = (A2 + B2 + C2 +D2)
X3 = (A3 + B3 + C3 +D3)

xmm4 =

we have than 4x 32 Values for more work


.... more operations

now im lookign for the best (fastest way) to reorder the values from Stage0 to Stage1 or if we can do the additions while reordering direct to Stage2:

Any tips?

thx
Posted on 2004-06-11 06:25:14 by Andy2222
shufps?
Posted on 2004-06-11 09:27:55 by Scali
Perhaps this is some help:

The fastest way i know to add up horizontally is like this:

Suppose you have A B C D in xmm0 and you want to get A+B+C+D



movdqa xmm7,xmm0 ; A B C D
pshufd xmm7,xmm7,01001110b ; C D A B
paddd xmm0,xmm7 ; A+C B+D C+A D+B
movdqa xmm7,xmm0 ; A+C B+D C+A D+B
pshufd xmm7,xmm7,00010001b ; D+B C+A D+B C+A
paddd xmm0,xmm7 ; A+C+D+B B+D+C+A C+A+D+B D+B+C+A


xmm7 is used as temporary storage only.

So this way you can get X0 etc straight away using that and you end up with this:

xmm0
xmm1
xmm2
xmm3

Now to get these in to xmm4...



movss xmm4,xmm3 ; .. .. .. X3
pslldq xmm4,4 ; .. .. X3 ..
movss xmm4,xmm2 ; .. .. X3 X2
pslldq xmm4,4 ; .. X3 X2 ..
movss xmm4,xmm1 ; .. X3 X2 X1
pslldq xmm4,4 ; X3 X2 X1 ..
movss xmm4,xmm0 ; X3 X2 X1 X0


Hope that helps :)
Posted on 2004-06-11 19:28:31 by stormix
mhh thx i had something similar in mind, but im still not happy with this solution.

The pshufd command is vectorized and take 4-6 cycles on an AMD.

Its makes me crazy i cant find a better solution to do the math before i always end up with 4 values (a, b ,c, d) packed in 1 xmm register and i need the result of them...

I cant use SSE3...
Posted on 2004-06-18 16:55:06 by Andy2222
Hi

You do NOT need the 2 instructions : "movdqa xmm7,xmm0", pshufd can do it directly iirc.

MMX pshufw allows this : pshufw mm7,mm0,integer.
So iirc the same with xmm pshufd
Posted on 2004-06-21 05:59:43 by valy
Uses SSE2 (P4 or AMD64), so some versions of MASM may have trouble...


movdqa xmm7, xmm0 ; XMM7 = D0, C0, B0, A0
punpckldq xmm0, xmm1 ; XMM0 = B1, B0, A1, A0
punpckhdq xmm7, xmm1 ; XMM7 = D1, D0, C1, C0

movdqa xmm1, xmm2 ; XMM1 = D2, C2, B2, A2
punpckldq xmm1, xmm3 ; XMM1 = B3, B2, A3, A2
punpckhdq xmm2, xmm3 ; XMM2 = D3, D2, C3, C2

movdqa xmm3, xmm7 ; XMM3 = D1, D0, C1, C0
punpckhqdq xmm3, xmm2 ; XMM3 = D3, D2, D1, D0
punpcklqdq xmm7, xmm2 ; XMM7 = C3, C2, C1, C0

movdqa xmm2, xmm0 ; XMM2 = B1, B0, A1, A0
punpckhqdq xmm0, xmm1 ; XMM0 = A3, A2, A1, A0
punpcklqdq xmm2, xmm1 ; XMM2 = B3, B2, B1, B0

; Conditional assembly
IF WANT_TABLE
; Table XMM0..3,
; XMM0 = A3, A2, A1, A0
; XMM1 = B3, B2, B1, B0
; XMM2 = C3, C2, C1, C0
; XMM3 = D3, D2, D1, D0
; XMM7 = (D3 + C3 + B3 + A3), (D2 + C2 + B2 + A2), (D1 + C1 + B1 + A1), (D0
+ C0 + B0 + A0)

movdqa xmm1, xmm2
movdqa xmm2, xmm7

paddd xmm7, xmm3
paddd xmm7, xmm1
paddd xmm7, xmm0

ELSE
; 0 = As, 2 = Bs, 7 = Cs, 3 = Ds
paddd xmm7, xmm3 ; XMM7 = (C3 + D3), (C2 + D2), (C1 + D1), (C0 + D0)
paddd xmm0, xmm2 ; XMM0 = (A3 + B3), (A2 + B2), (A1 + B1), (A0 + B0)
paddd xmm0, xmm7

; XMM0 = (D3 + C3 + B3 + A3), (D2 + C2 + B2 + A2), (D1 + C1 + B1 + A1), (D0
+ C0 + B0 + A0)
ENDIF


Mirno
Posted on 2004-06-21 12:13:02 by Mirno
ah cool nice solution without vectorized instructions :) thx

btw some1 know what latency the new hadd... (horizontal add) instruction from sse3 has? Is it usefull or also vectorized and stall the pipeline?
Posted on 2004-06-21 12:56:04 by Andy2222