Any hint about this please?

1 = 1*10^0

213 = 2.13*10^2

bin

1100 = 1.1100*10^100

The only diference between float and double is the biased exponent.

213 = 2.13*10^2

bin

1100 = 1.1100*10^100

The only diference between float and double is the biased exponent.

in assembly how can I do it if registers are 32 bits longs not 64 bits?

I must give a flot and it must return a double

I must give a flot and it must return a double

`REAL8_REAL4 PROC num8:PTR, num4:PTR`

mov eax, num4

mov edx, num8

fld REAL4 PTR

fstp REAL8 PTR

ret

REAL8_REAL4 ENDP

MyNum8 REAL8 ?

MyNum4 REAL4 3.14

invoke REAL8_REAL4, ADDR MyNum8, ADDR MyNum4

...I leave the explaination to you. :)in assembly how can I do it if registers are 32 bits longs not 64 bits?

I must give a flot and it must return a double

If you can't move a pile of dirt with one shovel, you move it one shovel load at a time.

In addition to adding zeroes to the right of your float, you will need to change the number of bits in the exponent. Without looking it up, I'm guessing there's 11 or 12 bits of exponent in the double format.

Any hint about this please?

look up the fild instruction.

Cheers,

RandyHyde

For those of you who haven't noticed, this is a class assignment in doing fp the hard way.

yep, either that or this guy loves the fpu and wants to know it in detail

In decimal scientific notation, there is only ever one digit to the right of the decimal point, the same is true for IEEE floating point mathematics. However in accordance with the IEEE floating point spec, in order to save a bit of storage, there is an implied 1 at the top of the mantissa (effectively all calculations must be made as a 24 bit mantissa were available in the case of a 32 bit float).

In order to convert an integer to a floating point value, we must determine which the index value of our most significant set bit is. This will be a value between 0 and 31 if the initial integer value is non-zero.

Note that as we are converting an integer value to a floating point, we will only ever need to add to the exponent (exponents less than 127 indicate fractional values less than 1). Nor can we express a NaN, or infinite value using a 32 bit integer, so we don't need to worry about these either.

The above is of course assuming that you are dealing with an unsigned integer, and that you want to truncate the integer rather than round. Things get more complicated if you wish to deal with either case.

From now on I won't re-type the code to deal with zeros, it should be considered to be implied.

The actual code at this point is rather trivial.

Use of one of the bit scan opcodes (bsf or bsr) will be needed.

Several cmps & jmps, to branch based on conditions (code path will be different if integer is zero, if the integer has a bit set higher than position 23 (you need to shift the other way).

One and will be needed to remove the bit 24, post shift.

In the case of the double, the above holds true, except the bias for the exponent is 1023, and it's position is 52..62 (11 bits). Similarly, the mantissa occupies bits 0..51 (52 bits), but this means that no truncation is ever necessary, so you can optimise that section out of your function in this case (a 32 bit value will always fit in 52 bits!!!), and it will only ever be shifted in one direction. However in the case of the double, the mantissa being 52 bits, does not fit within a 32 bit register, so there will be a certain amount of bit fiddleing to do (it should just be shifts, ands and ors).

Really if you can't work out what you are meant to do from this, I'd give up and go home. The next step is to give you the answer verbatim, which will be cheating. I don't want to cheat, you don't want to cheat, nobody here want's to cheat, so that's simply not an option. Perhaps if you still don't understand you should show this to your teacher / professor and ask him for more help.

Mirno

In order to convert an integer to a floating point value, we must determine which the index value of our most significant set bit is. This will be a value between 0 and 31 if the initial integer value is non-zero.

Note that as we are converting an integer value to a floating point, we will only ever need to add to the exponent (exponents less than 127 indicate fractional values less than 1). Nor can we express a NaN, or infinite value using a 32 bit integer, so we don't need to worry about these either.

IF our integer == 0

SIGN = 0

EXPONENT = 0

MANTISSA = 0

ELSE

SIGN = 0

EXPONENT = 127 + index of most significant set bit

MANTISSA = our integer shifted (left or right) so the most significant bit is in bit position 24.

Bit 24 is promptly thrown away!

The above is of course assuming that you are dealing with an unsigned integer, and that you want to truncate the integer rather than round. Things get more complicated if you wish to deal with either case.

From now on I won't re-type the code to deal with zeros, it should be considered to be implied.

SIGN = sign of integer value

temp = ABS(our integer)

index = the index of temp's most significant set bit

if (index > 24)

temp += 1 << (index - 24) ; Here we do our rounding!

EXPONENT = 127 + index of most significant set bit of temp

MANTISSA = temp shifted (left or right) so the most significant bit is in bit position 24.

Bit 24 is promptly thrown away!

The actual code at this point is rather trivial.

Use of one of the bit scan opcodes (bsf or bsr) will be needed.

Several cmps & jmps, to branch based on conditions (code path will be different if integer is zero, if the integer has a bit set higher than position 23 (you need to shift the other way).

One and will be needed to remove the bit 24, post shift.

In the case of the double, the above holds true, except the bias for the exponent is 1023, and it's position is 52..62 (11 bits). Similarly, the mantissa occupies bits 0..51 (52 bits), but this means that no truncation is ever necessary, so you can optimise that section out of your function in this case (a 32 bit value will always fit in 52 bits!!!), and it will only ever be shifted in one direction. However in the case of the double, the mantissa being 52 bits, does not fit within a 32 bit register, so there will be a certain amount of bit fiddleing to do (it should just be shifts, ands and ors).

Really if you can't work out what you are meant to do from this, I'd give up and go home. The next step is to give you the answer verbatim, which will be cheating. I don't want to cheat, you don't want to cheat, nobody here want's to cheat, so that's simply not an option. Perhaps if you still don't understand you should show this to your teacher / professor and ask him for more help.

Mirno