byteswapping xmm / sse2 registers (without using BSWAP)
aka, switching little endian to big endian and back.
Intro:
In SSE2 / XMM 128-bit registers, there is no BSWAP command. In fact there is no
way to shuffle bytes directly. You can shuffle quadwords, doublewords, and words,
but not bytes. What if you have 4 32-bit values in an XMM register that you
want to BSWAP?
You could copy repeatedly into the 32-bit registers, then BSWAP, then copy back out
to the 128-bit register. However...... you can also do it another way, without using
any general purpose 32-bit registers, or BSWAP. Instead you can use SSE2 Shuffle Words
commands.
So: given 1 xmm register (xmm5 here), swap the bytes within the 4 32-bit doublewords inside it.
uses two temporary registers.
how it works
in XMM / SSE2, you can't swap bytes. but you can...
1. swap words
2. 'inflate' bytes into words, by interleaving with 0
3. 'deflate' words back into bytes, chopping off the 0
EX:
input 16 bytes / 128-bits:
input register bytes: ABCD EFGH IJKL MNOP
inflate / unpack / interleave with 0: (PUNPCKHBW, PUNPCKLBW)
temp register 1: 0A0B 0C0D 0E0F 0G0H
temp register 2: 0I0J 0K0L 0M0N 0O0P
swap words: (PSHUFLW, PSHUFHW)
temp register 1: 0D0C 0B0A 0H0G 0F0E
temp register 2: 0L0K 0J0I 0P0O 0N0M
deflate / pack / de-interleave (PACKUSWB)
input register bytes: DCBA HGFE LKJI PONM
Bonus:
If you also want to swap the order of doublewords within the 128-bit register,
you can use one PSHUFD.
aka, switching little endian to big endian and back.
Intro:
In SSE2 / XMM 128-bit registers, there is no BSWAP command. In fact there is no
way to shuffle bytes directly. You can shuffle quadwords, doublewords, and words,
but not bytes. What if you have 4 32-bit values in an XMM register that you
want to BSWAP?
You could copy repeatedly into the 32-bit registers, then BSWAP, then copy back out
to the 128-bit register. However...... you can also do it another way, without using
any general purpose 32-bit registers, or BSWAP. Instead you can use SSE2 Shuffle Words
commands.
So: given 1 xmm register (xmm5 here), swap the bytes within the 4 32-bit doublewords inside it.
uses two temporary registers.
movdqu xmm0, xmm5
movdqu xmm1, xmm5
pxor xmm5, xmm5
punpckhbw xmm0, xmm5 ; interleave '0' with bytes of original
punpcklbw xmm1, xmm5 ; so they become words
pshuflw xmm0, xmm0, 0b00_01_10_11 ; swap the words by shuffling
pshufhw xmm0, xmm0, 0b00_01_10_11
pshuflw xmm1, xmm1, 0b00_01_10_11
pshufhw xmm1, xmm1, 0b00_01_10_11
packuswb xmm1, xmm0 ; pack/de-interleave, ie make the words back into bytes.
movdqu xmm5, xmm1
how it works
in XMM / SSE2, you can't swap bytes. but you can...
1. swap words
2. 'inflate' bytes into words, by interleaving with 0
3. 'deflate' words back into bytes, chopping off the 0
EX:
input 16 bytes / 128-bits:
input register bytes: ABCD EFGH IJKL MNOP
inflate / unpack / interleave with 0: (PUNPCKHBW, PUNPCKLBW)
temp register 1: 0A0B 0C0D 0E0F 0G0H
temp register 2: 0I0J 0K0L 0M0N 0O0P
swap words: (PSHUFLW, PSHUFHW)
temp register 1: 0D0C 0B0A 0H0G 0F0E
temp register 2: 0L0K 0J0I 0P0O 0N0M
deflate / pack / de-interleave (PACKUSWB)
input register bytes: DCBA HGFE LKJI PONM
Bonus:
If you also want to swap the order of doublewords within the 128-bit register,
you can use one PSHUFD.
NASM MACRO:
%macro xmmbswap 3
movdqu %3, %1
movdqu %2, %1
pxor %1, %1
punpckhbw %3, %1 ; interleave '0' with bytes of original
punpcklbw %2, %1 ; so they become words
pshuflw %3, %3, 0b00_01_10_11 ; swap the words by shuffling
pshufhw %3, %3, 0b00_01_10_11 ;
pshuflw %2, %2, 0b00_01_10_11
pshufhw %2, %2, 0b00_01_10_11
packuswb %2, %3 ; pack/de-interleave, ie make the words back into bytes.
movdqu %1, %2
%endmacro
END
I took the liberty of splitting your post into multiple text/code parts for easier reading - hope you don't mind :)
how about :
8)
pshufd xmm5,xmm5,000011011b
pshuflw xmm5,xmm5,10110001b
pshufhw xmm5,xmm5,10110001b
movdqa xmm0,xmm5
psrlw xmm0,8
psllw xmm5,8
por xmm5,xmm0
8)