I've just started using sse but quicly ran into a problem... How can I "mov" the value of a 128bit to a string?

FireToCelc proc x:DWORD
; formula: (5*(x-32))/9
; or: (0.5555555555555556)*(x-32)
    movss xmm0, x
    movss xmm1, _32
    subss xmm0, xmm1 ; xmm0 = x - 32
    movss xmm1, _5o9
    mulss xmm0, xmm1 ; xmm0 = (0.5555555555555556)*(x-32)

    ret
FireToCelc endp


Is there some kind of TwordToAscii function?
Posted on 2005-12-04 12:32:55 by Lenin
C library (msvcrt.dll) has the sprintf function. Calling it with "%f" (or "%+.1f" if you want the temperature) format is the fastest method. Alternatively, you can write your own "float2ascii".
Posted on 2005-12-04 12:44:31 by ti_mo_n
Using:
    invoke wsprintf, addr buffer, addr lpFmt, xmm0


Gives me an error.... Do I need to first convert the 128bit register into a 32bit one?
Posted on 2005-12-04 13:21:23 by Lenin
not user32.dll's wsprintf, but msvcrt.dll's sprintf. wsprintf doesn't support floating point values.

sprintf should be called exactly the same as wsprintf, but you CAN'T supply a XMM register as the operand, because it's both 128-bit and packed. Store the result somewhere in memory and supply that value as the third parameter.

something like:

movss , xmm0
invoke sprintf, addr buffer, addr lpFmt, result


and make sure that the buffer is large enough to hold the string. Otherwise you may get 'buffer overrun' which may lead to Denial of Service. I always make the buffers 256-bytes long ^^"
Posted on 2005-12-04 13:25:28 by ti_mo_n
Oh  :P Still I don't know how to link to msvcrt.dl ... I searched trough masm32's include and lib folders but couldn't find anything....
Posted on 2005-12-04 14:01:35 by Lenin
the include file isn't really necessary (at least in TASM ^^"). As for the lib: use something like "implib" to make LIBs from DLLs. implib produces LIBs for TASM.

That's why I use(d) TASM: much less trouble :P ;)

This might be useful if you opt to write your own float2ascii: > * <
Posted on 2005-12-04 14:09:33 by ti_mo_n
Would it be something like this then?

FloatToAscii proc float:QWORD, lpOut:DWORD
    LOCAL temp:DWORD, temp2:DWORD
.data
    Milion dd 1000000
.code
    ; turn to truncation mode?
    finit
    fld float
    fist temp
    fsub temp
    ; turn to round-to-nearest-integer mode?
    fmul Milion
    fistp temp2
    ; now reprent it as "temp . temp2"
    ret
FloatToAscii endp
Posted on 2005-12-04 17:17:48 by Lenin
Store million as 1000000.0 (floating point value). finit initializes the FPU, so you should do finit first and change the rounding mode AFTER it. A good habit is to do one "finit" at the beginning of your program if you plan to use the FPU. No more finits are required (unless you're doing very funny things ^^").

1. enable truncation:
fstcw 
or , 0300h
fldcw


2. enable rounding to nearest integer:
fstcw 
and , NOT 0300h ; 1111110011111111 (0FCFFh)
fldcw


controlword is a 16-bit (word) variable

You can also write this proc using the SSE.

After this proc you have 2 integers, so use user32.dll's 'wsprintf' function like this:

invoke wsprintf, addr buffer, addr format, temp, temp2

1. "buffer" must be large enough to store the string
2. "format" must be "%d.%d" (note the decimal point between each %d. you can use comma instead of dot)

after all of this you can simply do
invoke MessageBox, 0, addr buffer, 0, 0

:)
Posted on 2005-12-04 17:43:34 by ti_mo_n
Thanks a lot for your help :) Still I'm getting some weird results...

Here for example:
.data
    _5o9 REAL4 0.5555555555555556f
.code
    invoke FloatToAscii, _5o9, addr buffer
    invoke MessageBox, eax, 0, MB_OK ; gives me 1.555556 , almost right


But when I use it with my procedure....

FireToCelc proc x:DWORD
; formula: (0.5555555555555556)*(x-32)
    movss xmm0, x
    movss xmm1, _32
    subss xmm0, xmm1 ; xmm0 = x - 32
    movss xmm1, _5o9
    mulss xmm0, xmm1 ; xmm0 = (0.5555555555555556)*(x-32)
    movss result, xmm0
    invoke FloatToAscii, result, addr buffer
    ret
FireToCelc endp

FloatToAscii proc float:DWORD, lpOut:DWORD
    LOCAL temp:DWORD, temp2:DWORD, cWord:WORD
.data
    format db "%d.%d",0
    Milion REAL4 1000000.0
.code
    ; turn to truncation mode
    fstcw cWord
    or cWord, 0300h
    fldcw cWord

    fld float
    fist temp
    fsub temp
   
    ; turn to round-to-nearest-integer mode
    fstcw cWord
    and cWord, not 0300h
    fldcw cWord
   
    fmul Milion
    fistp temp2
    invoke wsprintf, lpOut, addr format, temp, temp2
    mov eax, lpOut
    ret
FloatToAscii endp


No matter what value I enter, I always get -18.-2147483648....
Posted on 2005-12-04 19:35:31 by Lenin
oops ^^" it's not 0300h but 0C00h (and "NOT 0C00h" instead of "NOT 0300h") :)  "0300h" and "NOT 0300h" switch between 64-bit precision and 32-bit precision, respectively.

and do "fisub temp" instead of "fsub temp".

now you should get 0.555556 from 0.555555555555555555555.

remember to NOT multiply by more than 1'000'000'000 otherwise the result will be larger than 0xFFFFFFFF so it won't fit in 32-bit variable. if yuo want more than 9 places, then repeat the steps (fistp->fsub->fmul) to obtain them.

as for the second bug: it must be somewhere inside the SSE function. confirm that the result is correct before calling FloatToAscii.

/edit

the SSE function is correct. 'FloatToAscii' should not end with mov eax, lpOut :) add "mov , temp" and "mov , temp2"  or whatever. you're not returning/storing the results, so they get lost.
Posted on 2005-12-04 20:01:03 by ti_mo_n
I don't see why FloatToAscii shouldn't return a pointer to the buffer it altered... I was checking functions like itoa and they do return a pointer to a buffer.... Since after the proc is called the buffer contains the final result I don't see the need to return the two dwords obtained trough the proccess... Still I could be totally wrong and in that case please corect me ;)

I did some minor changes in the code... Now FloatToAscii seems to be working but I can't make it work with FireToCelc... Now it's always returning -17.777779 ...

.data
    _32  REAL4 32.0f
    _5o9 REAL4 0.5555555555555556f
.code
    (...)
    invoke GetDlgItemInt, hWnd, IDC_FIREN, NULL, FALSE
    invoke FireToCelc, eax
    invoke FloatToAscii, eax, addr buffer
    invoke SetDlgItemText, hWnd, IDC_CELSIUS, addr buffer
    (...)

FireToCelc proc x:DWORD
.data
    result dd ?
.code
; formula: (0.5555555555555556)*(x-32)
    movss xmm0, x
    movss xmm1, _32
    subss xmm0, xmm1 ; xmm0 = x - 32
    movss xmm1, _5o9
    mulss xmm0, xmm1 ; xmm0 = (0.5555555555555556)*(x-32)
    movss result, xmm0
    mov eax, result
    ret
FireToCelc endp

FloatToAscii proc float:DWORD, lpOut:DWORD
    LOCAL temp:DWORD, temp2:DWORD, cWord:WORD
.data
    format db "%d.%d",0
    Milion REAL4 1000000.0
.code
    ; turn to truncation mode
    fstcw cWord
    or cWord, 0C00h
    fldcw cWord

    fld float
    fist temp
    fisub temp
   
    ; turn to round-to-nearest-integer mode
    fstcw cWord
    and cWord, not 0C00h
    fldcw cWord
   
    fmul Milion
    fabs ; to avoid having numbers like -1.-486
    fistp temp2
    invoke wsprintf, lpOut, addr format, temp, temp2
    mov eax, lpOut
    ret
FloatToAscii endp
end start
Posted on 2005-12-04 23:21:03 by Lenin
Sorry, I missed the "wsprintf" line. It was late and I was sleepy :P

As for the FireToCelc: You pass an integer value, while you should pass a single-precision floating point value. There IS a way to make this function work with integers, but it will require SSE2 (load scalar integer -> convert to single-precision float, if I recall correclty). The simpliest (but definitely NOT the fastest) way to convert integer to float is "fild -> fstp" pair with truncation enabled.
Posted on 2005-12-05 06:40:39 by ti_mo_n
FireToCelc proc x:DWORD
.data
    result dd ?
.code
; formula: (0.5555555555555556)*(x-32)
    fld x
    fstp result
    movss xmm0, result
    movss xmm1, _32
    subss xmm0, xmm1 ; xmm0 = x - 32
    movss xmm1, _5o9
    mulss xmm0, xmm1 ; xmm0 = (0.5555555555555556)*(x-32)
    movss result, xmm0
    mov eax, result
    ret
FireToCelc endp


Keep getting -17.777779 ... I wonder if the error is in "movss result, xmm0", is result large enough to hold xmm0?
Posted on 2005-12-05 11:28:45 by Lenin
fild x, not fld x. you're loading an integer. The SSE part is fine - I've tested it.

movss result, xmm0 moves 32-bit floating point value located at the beginning (bits: 0-31) of the xmm register, NOT the whole xmm register.  scalar operations operate on bits 0-31 of SSE registers. they produce single result from single operands. it is 1/4 of SSE's power. Packed SSE is the true power: they produce 4 results from 8 operands (multiple data) from single instruction (hence the name:  SIMD ). scalar sse is good way to omit using the FPU. all the things in this topic can be written using scalar sse only.

packed version is: movups or movaps. it loads 128-bits (4 single precision floating point values [16 bytes]) from/(to) a xmm register to/(from) a memory location (either Aligned, or Unaligned, hence the names: movaps and movups). the aligned version requires that the memory operand is 16-byte aligned.
Posted on 2005-12-05 12:34:42 by ti_mo_n
Oh thanks a lot for the explanation :) May I bother you a little more? :P Now everything is running finely but I would like to know how I would rewrite the FloatToAscii function using sse or sse2... I beleive there's no need to use packed sse here, is it? I would just like to know if there are opcodes to operate on integers on sse, and if so is there some kind of reference with it's opcodes?
Posted on 2005-12-05 13:07:28 by Lenin
To deal with integers inside XMM reigsters you need SSE2.

MMX instructions operate on MMX registers (which are alias to FPU registers) and are integer instructions. They're great to write audio or video codecs, for example (hence the name: Multimedia Extensions).

SSE intructions operate on 32-bit (single precision) floating point values when using XMM registers, and on Integer values when using MMX registers. They're designed for 3d functions where they can aid video cards. They were supposed to be Intel's response to AMD's 3dnow! instructions. 3dnow! instructions use 3dnow! registers (which are alias to FPU registers, like MMX). 3dnow! registers are 64-bit and they hold 2 single-precision (32 bit) floating-point values, each.

SSE2 instructions operate on both MMX registers and XMM registers and are both single-precision FP, and integer intructions. Additinally they support 'integer <-> floating point' conversion and can operate on 64-bit (double precision) floating-point values (and the support the appropriate conversion methods). They're designed to aid advanced maths applications, like speech recognition, etc.

Of course everyone can find their own application to these instructions :)

After this short introduction ( :P ) we see that we need either 3dnow! or SSE2 :) Let's stick to SSE2 (mainly because I'm not familiar enough with AMD's architecture ^^" ).

SSE, SSE2 and MMX instructions can be freely mixed inside the application. You have to remember 2 things, though: some instructions (MMX ones and few SSE/2) require a MMX register as an operand (these are 64-bit operations) and some instructions require a XMM register a an operand (these are 128-bit instructions). The second thing is that when Intel added the SSE2 they also increased the functionality of some already existing SSE and MMX instructions. So if you have SSE2-capable processor, you may -for example- use a SSE instruction with a MMX register, but someone with a CPU supporting SSE, but not SSE2, will get Undefined Instruction (#UD) Exception, regardlles of the fact that it's a SSE instrucion. So you have to remember that SSE2 is not only new instructions, but also an 'upgrade' of the SSE and MMX ones. Don't forget about it if you want your app to work on a wide variety of CPUs. ...Well, to be honest - you don't have to care, because nowadays almost everyone has Athlon XP or Pentium 4 :P But it's nice to check wheter the CPU supports MMX/SSE/SSE2 or not, before using a MMX/SSE/SSE2 function :) Windows XP requires a Pentium-class CPU, so it's guaranteed that the win32 app may use CPUID instruction (you don't have to check everything from 8086 through 286, 386, 486 to pentium :) ).

...But let's get to the point already ^^"

As I've said earlier: Scalar instructions keep their operands and results in low-order dword (bits: 0-31) of XMM registers. Packed instructions use whole XMM register to produce multiple data from multiple operands in one instruction.

To load an integer into a XMM register you use MOVD instruction. MOVD instruction is a MMX instruction, but requires SSE2 to operate on XMM registers. CVTSI2SS instruction converts a scalar integer and stores it in a XMM register. This conversion instruction actually LOADS and converts the value, so we don't need the MOVD here (I mentioned it just for you to know ;) ).

So the only thing you must add is:

CVTSI2SS xmm0, x


(and pray for your compiler to support SSE2 :P )

CVTSI2SS means: Covert Scalar Integer 2 Scalar Single-precision value.

That's it. After this instruction you have a ready-to-work floating-point value :)

You must admit that it's a heck of a long introduction for such a short thing to say :P

Now the FloatToAscii function:

Both the FPU and SSE2 each have their own control register which controls how certain operations are handled. Especially precision and rounding mode. With SSE2 everything is simple: there are seperate instructions for single-precision operations (SSE) and separate ones for double-precision operations. With FPU we can control it by properly setting bits 8 and 9. So simply it means that when we switch to SSE we have one thing less to worry about :) Now the rounding mode: SSE operation depend on the MXCSR register. This register works much like FPU control word, except that this one is 32-bit (FPU's control word is... well.. word :P ). In order to alter it, we need to store it somewhere (just like in case of FPU's CW), set the flags as we want it, and load it back into the CPU. We store the MXCSR with the STMXCSR instruction ("Store MXCSR") and load it back with LDMXCSR instruction ("Load MXCSR"). After we have stored it, we alter the bits 13 and 14 which control the rounding mode (we need to set them both if we need truncaton and clear them both if we need round-to-nearest-integer).

The algo goes as follows:

cvttss2si xmm0, float   ;xmm0 = integral part of 'float' stored as integer
movss     xmm1, float   ;xmm1 = float
cvtsi2ss  xmm0, xmm0    ;xmm0 = integral part of float stored as 'float'
subss     xmm1, xmm0    ;xmm1 = fractional part of 'float'
movss     xmm2, million ;xmm2 = multiplier
cvttss2si temp, xmm0
mulss     xmm1, xmm2    ;xmm1 = multiplied fractional part
cvttss2si temp2, xmm1
invoke wsprintf.... blah blah blah


If I didn't make any mistake, then this should do it.

one word about the conversion instructions: note the double "T" in 3 of 4 conversions used above. this additional "T" means that the conversion should be made with truncation REGARDLESS of the MXCSR register. These are very nice instrustions, because they allow us to perform many differect conversions without the need to alter the MXCSR in any way :)

The above code has a single small dependency (compared to 8 using the FPU code), the code is free of any redundant instructions (like setting/resetting the control word), we don't need the additional variable to store the control word or MXCSR, and ..hey! it's SSE! It's cool to code with :P :)

I hope the above works properly.

You may ask why we need 2 conversions, and how does the whole aglorithm works. Here comes the explaination:

cvttss2si xmm0, float
This first instruction loads the 'float' variable, converts it to integer using truncation (regardless fo MXCSR) and stores it as scalar integer in bits 0-31 inside the XMM0. after this we have integer (which is integral part od "float", because the fractional part gets truncated) in low-order dword of xmm0.

movss     xmm1, float
This one is straightforward: it simply moves the "float" into bits 0-31 of xmm1 (scalar single precision value)

cvtsi2ss  xmm0, xmm0
this one converts the contents of xmm0. you see: we have stored an INTEGER inside the xmm, bu we need a FLOAT to substract it later. That's why we must convert this integer to float.

subss     xmm1, xmm0
now we substract one scalar single-precision from another.

movss     xmm2, million
now we load the multiplier (1000000.0)

cvttss2si temp, xmm0
the value of xmm0 (which is scalar single-precision float) gets converted into an integer and stored in a memory operand ("temp")

mulss     xmm1, xmm2
now the fractional part gets multiplied

cvttss2si temp2, xmm1
the same is done as it was to the xmm0

Example:
'float' = 321.123

cvttss2si xmm0, float   ;xmm0 = 321
movss     xmm1, flaot   ;xmm1 = 321.123
cvtsi2ss  xmm0, xmm0    ;xmm0 = 321.0
subss     xmm1, xmm0    ;xmm1 = 0.123
movss     xmm2, million ;xmm2 = 1000000.0
cvttss2si temp, xmm0    ;store 321
mulss     xmm1, xmm2    ;xmm1 = 123000.0
cvttss2si temp2, xmm1   ;store 123000


when you display it, you'll get "321.123000"

I think that's all.
Posted on 2005-12-05 13:55:16 by ti_mo_n
Thanks a lot for the wonderfull explanation :) It'll be a refference to me from now on...

Your code didn't compile, so I had to change it a little...

FloatToAscii proc float:DWORD, lpOut:DWORD
    LOCAL temp:DWORD, temp2:DWORD
.data
    format db "%d.%d",0
    million REAL4 1000000.0
.code
    cvttss2si eax, float    ;xmm0 = integral part of 'float' stored as integer
    movss    xmm0, float  ;xmm1 = float
    cvtsi2ss  xmm1, eax    ;xmm3 = integral part of float stored as 'float'
    subss    xmm0, xmm1    ;xmm1 = fractional part of 'float'
    movss    xmm2, million ;xmm2 = multiplier
    mov      temp, eax
    mulss    xmm0, xmm2
    cvttss2si eax, xmm0
    mov      temp2, eax
    invoke    wsprintf, lpOut, addr format, temp, temp2
    mov      eax, lpOut
    ret
FloatToAscii endp


Thanks you once again :)
Posted on 2005-12-05 16:22:20 by Lenin
Oh yes - I forgot that the conversion instructions must reference the GP registers as their destinations

The source operand can be an XMM register or a 32-bit memory location. The destination operand is a general-purpose register.
Posted on 2005-12-05 16:26:19 by ti_mo_n
Hehe I can't sleep so I'm trying to improve the algo... I'm quite pleased with the function, but there are some results that are annoyng me... They are having zeros at the end of decimal places ie: 1.990000 and negative numbers having two negative signs ie: -7.-149 ....

I have an idea on the first one: Get how many decimal digits are there, like there are 4 decimal digits in 0.1234, then use that number to  select a member of an array that has x zeros.... Look:

    array REAL4 10.0, 100.0, 1000.0, 10000.0, 100000.0, 1000000.0

    num = 0.1234
    len = getDecimaldigits(num) ; len = 4
    mulss xmm0, array[(len-1)*4] ; xmm0 = xmm0*10000.0
    cvttss2si ecx, xmm0 ; ecx = 1234
    (etc...)


The problem here is how to get the number of decimal digits... The only way I can think of is to convert it into a string and then strlen it... But that would be terribly unefficient...

I don't know how I would solve the second one tough, do you know wich bits store the sign of the number?

Here's the code so far...
FloatToAscii proc float:DWORD, lpOut:DWORD
.data
    format db "%d.%d",0
    million REAL4 1000000.0
.code
    cvttss2si eax, float    ;eax = integral part of 'float' stored as integer
    movss    xmm0, float  ;xmm1 = float
    cvtsi2ss  xmm1, eax    ;xmm3 = integral part of float stored as 'float'
    subss    xmm0, xmm1    ;xmm1 = fractional part of 'float'
    mulss    xmm0, million
    cvttss2si ecx, xmm0
    invoke    wsprintf, lpOut, addr format, eax, ecx
    mov      eax, lpOut
    ret
FloatToAscii endp
Posted on 2005-12-06 00:56:48 by Lenin
ti_mo_n,

Why don't you save what you have written into our x86 book?  ;)

Regards,
Victor
Posted on 2005-12-06 04:36:50 by roticv