Either MOVMSKPS - Extract Packed Single-Precision Floating-Point Sign Mask - isn't following the specification correctly, or I am misunderstanding something. When I use it, it reverses the order of the bits that it extracts from the XMM register before putting them in the low order bits of the general purpose register. The Intel manual and every source I can find online says that the order should be preserved and not reversed, as shown in this extraction:
DEST[0] ? SRC[31];
DEST[1] ? SRC[63];
DEST[2] ? SRC[95];
DEST[3] ? SRC[127];
To illustrate the ordering being reversed that I'm seeing, here is some sample code:
section .data   align=16
      align  16
fzzz    db      0xFF,0xFF,0xFF,0xFF,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00
section .bss
section .text
      global  main
main:
      movdqa  xmm0,
      movmskps        eax,xmm0
      mov    eax,1
      mov    ebx,0
      int    80h
What I use in the terminal in Linux (Ubuntu 9.10) to compile and debug it:
nasm -f elf -g -l movmskps.lst movmskps.asm
gcc -g -o movmskps.out movmskps.o
gdb movmskps.out
And the debugger's output:
8               movdqa  xmm0,
(gdb) x/16xb &fzzz
0x804a020 <fzzz>:      0xff    0xff    0xff    0xff    0x00    0x00    0x00    0x00
0x804a028:      0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
(gdb) next
9              movmskps        eax,xmm0
(gdb) print/x $xmm0
$1 = {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0xff, 0xff, 0xff, 0xff, 0x0 <repeats 12 times>},
    v8_int16 = {0xffff, 0xffff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0xffffffff, 0x0, 0x0, 0x0},
    v2_int64 = {0xffffffff, 0x0}, uint128 = 0x000000000000000000000000ffffffff}
(gdb) next
10              mov    eax,1
(gdb) print/x $eax
$2 = 0x1
I've done some further testing, and confirmed that the order of the mask bits is consistently reversed before being placed in the general purpose register.

I can work with it like this if it continues to and reliably works like this, but I'd like to make sure this is what's supposed to be happening. I imagine that I am having a little-endian problem in my understanding somewhere, but I've more than triple checked, and it really does look like the Intel manual is wrong in this case, or my processor is wrong :shock: (Oh, it's an Intel Core 2 Duo E8400 Wolfdale).
Posted on 2010-01-14 14:12:26 by pgn674
Endian "changes" on the x86 tend to happen when transferring data between memory and registers, so I would look closer at what movdqa is really doing.

In fact, within the following...


(gdb) print/x $xmm0
$1 = {v4_float = {0x0, 0x0, 0x0, 0x0}, v2_double = {0x0, 0x0}, v16_int8 = {0xff, 0xff, 0xff, 0xff, 0x0 <repeats 12 times>},
    v8_int16 = {0xffff, 0xffff, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, v4_int32 = {0xffffffff, 0x0, 0x0, 0x0},
    v2_int64 = {0xffffffff, 0x0}, uint128 = 0x000000000000000000000000ffffffff}


uint128 would suggest that movdqa is indeed reversing the byte order, and this sounds consistent for memory/register data transfer operations on the x86, especially since you are using successive db's instead of something like do.
Posted on 2010-01-14 15:56:27 by SpooK
I was considering that too, but look at what v16_int8 shows. Seeing that, I was thinking that when GDB does a print for an XMM register, it might format the appearance of uint128 to match as it would appear if you placed it back in memory, due to little-endianness. I might be wrong there, though.

I'll try some more tests to see if I can figure out the exact operation and endianness of MOVDQA when I get back to my machine. If it really is little-endian and I've been reading GDB's "print/x $xmm0" incorrectly all this time, then I have to question how on earth the rest of my program works.
Posted on 2010-01-14 16:14:39 by pgn674

If it really is little-endian and I've been reading GDB's "print/x $xmm0" incorrectly all this time, then I have to question how on earth the rest of my program works.


Based on examples found here and here, and the explanation here, I would say that this is the case.
Posted on 2010-01-14 16:49:55 by SpooK
But in the end it's just PEBKAC happens to the best of us (so I've been told :D).
I prefer good old fashioned statements everywhere for debugging.


jmp start
align 16
make0001  dd -1.0, 0.0, 0.0, 0.0 ;; -1.0 is loaded into XMM0[0-31]
make1000  dd 0.0, 0.0, 0.0, -1.0 ;; -1.0 is loaded into XMM0[96-127]
start:

        MOVDQA          xmm0, dqword
        MOVMSKPS        eax, xmm0
        push            eax
        push            szFormat
        call            ;; "1" will be printed
        MOVDQA          xmm0, dqword
        MOVMSKPS        eax, xmm0
        push            eax
        push            szFormat
        call            ;; "8" will be printed

        push            _pause
        call           
        push            0
        call           

Posted on 2010-01-15 12:29:55 by r22
OK, I did a bit of testing, and I was indeed reading the debugger's output incorrectly. Instead of looking at "print/x $xmm0"'s v16_int8, I should have been looking at its uint128. Also, little endian does take place for XMM registers. I'll probably have to watch out for whether the little-endianess is reversing the order for all 16 bytes as one chunk, or for smaller chunks, depending on what kind of data the instruction I'm using thinks it's dealing with.

To help me test, I used MOVLPD (Move Low Packed Double-Precision Floating-Point Value) and PEXTRB (Extract Byte). I learned that the low order bytes are on the right side of what uint128 shows, and that an offset starts from the right side of what uint128 shows. I'm all good now; thank you for your help.
Posted on 2010-01-29 01:16:12 by pgn674