Hello!
I'd like a way to automatedly check certain windows .dll-s and .exe-s whether or not the use SSE2 and other instruction sets.

If they do, a list of instructions they use, would also help.
I also would like help with a good debugger, because I have hard time with ollydbg 2 if it's SSE2,SSE3, etc...

Thank you
Posted on 2010-08-19 16:14:06 by amocsy
Almost all of 64-bit Windows DLLs use SSE2. Almost none of 32-bit Windows DLLs use any SSE.
Posted on 2010-08-19 17:54:26 by ti_mo_n
Yes, and there are two reasons why SSE2 is used a lot in 64-bit:
1) The 64-bit extensions were introduced after SSE2, so technically SSE2 are not extensions in 64-bit. All 64-bit processors have SSE2 by default.
2) x87 is deprecated in 64-bit mode. All floating point operations should be done with SSE2.

Anyway, if you want to find out if some code uses SSE2, you'd have to at least partially disassemble the binary, in order to find out where the code is (trace through the jumps and calls and such).
And there are also various binaries that contain SSE2 code, but also a fallback path for older CPUs... so even if you detect SSE2, it doesn't necessarily mean that it uses that code.
Posted on 2010-08-20 02:12:27 by Scali
It is of no matter if it has alternative path or not, so let me rephrase it: I'd like to check whether or not a binary support sse2.

1. But I don't think it's necessary to disassemble the binary, wouldn't knowing (numeric)opcodes be enought?
2. Also sse2 uses own registers, wouldn't it be enought to check if a binary use such registers or not?

This is two possible ways it could work, maybe someone can come up with something better.

Are there no established techniques to just let me know? It's not necessarily be a programmatic one.
More like I'd be happy with a virus scanner like approach, which would instead of telling me I have a new variant of some malicious code, it would tell me I have sse2 code (like an infection :D ).

I would like to make detailed statistics abort sse2,3, adoption in my environment.
Posted on 2010-08-21 15:01:26 by amocsy
Your reply shows this topic is beyond your current knowledge. I suggest you should read first about binary encoding of CPU instructions and assembly in general.

And "a virus scanner like approach" is pretty much what Scali said. It's called Disassembly. Other approaches virus scanners use (pattern matching, heuristic analysis, etc) are not applicable in this scenario.
Posted on 2010-08-21 19:10:34 by ti_mo_n
2) x87 is deprecated in 64-bit mode.


No, that's not right. For user threads the state of legacy floating point is preserved at context switch. But it is not true for kernel threads. Therefore only kernel mode drivers can not use legacy floating point instructions. Please check the following link, provided by MS:

http://msdn.microsoft.com/en-us/library/a32tsf7t(VS.80).aspx

All floating point operations should be done with SSE2.


That depends. If one need a better numerical accuracy, one should use FPU instructions, because the calculation is done internally with 80 bit, while SSE operates with 64 bits.

Gunther
Posted on 2010-08-23 12:25:15 by Gunther

2) x87 is deprecated in 64-bit mode.


No, that's not right. For user threads the state of legacy floating point is preserved at context switch. But it is not true for kernel threads. Therefore only kernel mode drivers can not use legacy floating point instructions. Please check the following link, provided by MS:

http://msdn.microsoft.com/en-us/library/a32tsf7t(VS.80).aspx

All floating point operations should be done with SSE2.


That depends. If one need a better numerical accuracy, one should use FPU instructions, because the calculation is done internally with 80 bit, while SSE operates with 64 bits.

Gunther



I think you just misunderstand the meaning of the term 'deprecated'.
Basically it means "Yea, it still works, but we don't guarantee that it will still work in the future. We suggest you no longer make use of it."
I pasted the MSDN link and a quote a while ago on this forum... can't be arsed to look for it now.

Edit: for some reason I recalled that I used the phrase "MSDN to the rescue"... and searching for that gave me the right post immediately:
http://www.asmcommunity.net/board/index.php?topic=29617.msg210597#msg210597

As for more than 64-bit precision... All CPU manufacturers have decided LONG ago that it's useless (Microsoft's compiler has never supported a datatype for it either, like long double or such).
x86 and 68k are pretty much the only architectures that offered more than 64-bit precision in hardware. If you need more than 64-bit, you're doing it wrong.
Posted on 2010-08-23 12:31:43 by Scali
If you need more than 64-bit, you're doing it wrong.
Or rather, if you need more than 64bit precision... you need something that isn't floating-point.
Posted on 2010-08-23 13:21:14 by f0dder

Or rather, if you need more than 64bit precision... you need something that isn't floating-point.


Exactly.
Posted on 2010-08-23 13:34:42 by Scali
I think you just misunderstand the meaning of the term 'deprecated'.


No.

Basically it means "Yea, it still works, but we don't guarantee that it will still work in the future. We suggest you no longer make use of it."


There has been widespread confusion about whether 64-bit Windows allows the use of the floating point registers. But that question was cleared some years ago inside the PlanetAMD64 forum: http://www.planetamd64.com/index.php?showtopic=3458&st=100 Here is the central quotation from that thread:


I did more than that, I emailed one of M$ kernel guys:

Sent: Wednesday, May 25, 2005 8:41 PM
Subject: FPU and MMX in x64?

I have read somewhere in MSDN (perhaps it was in DDK part) that Windows x64 will not preserve state of FPU and MMX registers across context switch and that the code written to take advantage of FPU and MMX will not work. Does that still apply and if it does what is the scope? 64-bit apps, drivers, 32-bit apps or all of them?

And here is the response I got:

From: Program Manager in Visual C++ Group
Sent: Thursday, May 26, 2005 10:38 AM

It does preserve the state. It's the DDK page that has stale information, which I've requested it to be changed. Let them know that the OS does preserve state of x87 and MMX registers on context switches.

From: Software Engineer in Windows Kernel Group
Sent: Thursday, May 26, 2005 11:06 AM

For user threads the state of legacy floating point is preserved at context switch. But it is not true for kernel threads. Kernel mode drivers can not use legacy floating point instructions.


The next point:

Microsoft's compiler has never supported a datatype for it either, like long double or such


That has to do with the idleness of the compiler designers. So far as I know, there are also a lot of C/C++ compilers, which don't support BCD values. I know at least 2 compilers with BCD and extended float support.

On the other hand, it makes a great difference, if you're operating internally with 80 or with 64 bits (rounding errors). If in doubt, read the following text by William Kahan "The Father of Floating Point": http://www.cs.berkeley.edu/~wkahan/LOG10HAF.TXT

By the way, with an assembly language application, you could support BCD, float, double, long double and a lot of other funny formats.

Gunther


Posted on 2010-08-23 14:19:35 by Gunther
I think you just misunderstand the meaning of the term 'deprecated'.

No.

No offense intended, but, seriously, that's exactly what "deprecated" means.
Posted on 2010-08-23 15:06:30 by ti_mo_n
As ti_mo_n points out, you still don't seem to understand what 'deprecated' means.
Yes it works, but no, you shouldn't be using the functionality.
The confusion is because originally Microsoft didn't want to support x87/MMX/3DNow! at all (as per AMD's and Intel's recommendations, see their x64 documentation), but they later decided against it... for now.

Oh god, BCD... another thing that CPU designers have LONG given up because there really is no point to it. This guy sounds like he's just timewarped here from the 70s.
Posted on 2010-08-23 15:25:09 by Scali
Okay, okay the community guru has spoken. I don't know what deprecated means. But I can read the official statements from MS kernel programmers. And that says: The FPU state is preserved during task switches. Period.

And of course, that old fashioned BCD format (seems like timewarped from the 70s). So try to write a few lines of reasonable code for, let me say, book-keeping or bank software without BCD. Furthermore, try to convert 0.1 (decimal) into the appropriate binary format and you will know why BCD is a point. But never mind.

Gunther
Posted on 2010-08-23 16:16:11 by Gunther

Okay, okay the community guru has spoken. I don't know what deprecated means. But I can read the official statements from MS kernel programmers. And that says: The FPU state is preserved during task switches. Period.


Yes, and that behaviour is marked as 'deprecated'.

And of course, that old fashioned BCD format (seems like timewarped from the 70s). So try to write a few lines of reasonable code for, let me say, book-keeping or bank software without BCD. Furthermore, try to convert 0.1 (decimal) into the appropriate binary format and you will know why BCD is a point. But never mind.


Just because it seemed a good idea to do that in hardware in the 70s doesn't mean that this assumption is still valid today.
Like with FPU precision beyond 64 bits, no modern CPU architecture supports BCD. Intel/AMD still support it, but only through slow microcode emulation.
All modern software is written in high-level languages, which provide various optimized libraries to convert from/to decimal, optimized for performance, without the need for BCD hardware support.
THAT's why you are talking like you timewarped from the 70s.
Common problem with x86 assembly programmers... they see all these weird x86 instructions and think they actually are useful.
Nope, they're just legacy from the 70s, when x86 was designed. Most of it isn't even implemented in hardware directly, and in most cases doing it 'the compiler way' (as in not using the esoteric instructions, but just using the regular modern CPU subset of optimized instructions) is the fastest way.
Here's another one for ya: rep movsd? Rubbish, total rubbish!
Here's what AMD suggests for a memcpy instead:
/******************************************************************************

Copyright (c) 2001 Advanced Micro Devices, Inc.

LIMITATION OF LIABILITY:  THE MATERIALS ARE PROVIDED *AS IS* WITHOUT ANY
EXPRESS OR IMPLIED WARRANTY OF ANY KIND INCLUDING WARRANTIES OF MERCHANTABILITY,
NONINFRINGEMENT OF THIRD-PARTY INTELLECTUAL PROPERTY, OR FITNESS FOR ANY
PARTICULAR PURPOSE.  IN NO EVENT SHALL AMD OR ITS SUPPLIERS BE LIABLE FOR ANY
DAMAGES WHATSOEVER (INCLUDING, WITHOUT LIMITATION, DAMAGES FOR LOSS OF PROFITS,
BUSINESS INTERRUPTION, LOSS OF INFORMATION) ARISING OUT OF THE USE OF OR
INABILITY TO USE THE MATERIALS, EVEN IF AMD HAS BEEN ADVISED OF THE POSSIBILITY
OF SUCH DAMAGES.  BECAUSE SOME JURISDICTIONS PROHIBIT THE EXCLUSION OR LIMITATION
OF LIABILITY FOR CONSEQUENTIAL OR INCIDENTAL DAMAGES, THE ABOVE LIMITATION MAY
NOT APPLY TO YOU.

AMD does not assume any responsibility for any errors which may appear in the
Materials nor any responsibility to support or update the Materials.  AMD retains
the right to make changes to its test specifications at any time, without notice.

NO SUPPORT OBLIGATION: AMD is not obligated to furnish, support, or make any
further information, software, technical information, know-how, or show-how
available to you.

So that all may benefit from your experience, please report  any  problems
or  suggestions about this software to 3dsdk.support@amd.com

AMD Developer Technologies, M/S 585
Advanced Micro Devices, Inc.
5900 E. Ben White Blvd.
Austin, TX 78741
3dsdk.support@amd.com
******************************************************************************/

/*****************************************************************************
MEMCPY_AMD.CPP
******************************************************************************/

// Very optimized memcpy() routine for all AMD Athlon and Duron family.
// This code uses any of FOUR different basic copy methods, depending
// on the transfer size.
// NOTE:  Since this code uses MOVNTQ (also known as "Non-Temporal MOV" or
// "Streaming Store"), and also uses the software prefetchnta instructions,
// be sure youre running on Athlon/Duron or other recent CPU before calling!

#define TINY_BLOCK_COPY 64      // upper limit for movsd type copy
// The smallest copy uses the X86 "movsd" instruction, in an optimized
// form which is an "unrolled loop".

#define IN_CACHE_COPY 64 * 1024  // upper limit for movq/movq copy w/SW prefetch
// Next is a copy that uses the MMX registers to copy 8 bytes at a time,
// also using the "unrolled loop" optimization.  This code uses
// the software prefetch instruction to get the data into the cache.

#define UNCACHED_COPY 197 * 1024 // upper limit for movq/movntq w/SW prefetch
// For larger blocks, which will spill beyond the cache, its faster to
// use the Streaming Store instruction MOVNTQ.  This write instruction
// bypasses the cache and writes straight to main memory.  This code also
// uses the software prefetch instruction to pre-read the data.
// USE 64 * 1024 FOR THIS VALUE IF YOURE ALWAYS FILLING A "CLEAN CACHE"

#define BLOCK_PREFETCH_COPY  infinity // no limit for movq/movntq w/block prefetch
#define CACHEBLOCK 80h // number of 64-byte blocks (cache lines) for block prefetch
// For the largest size blocks, a special technique called Block Prefetch
// can be used to accelerate the read operations.  Block Prefetch reads
// one address per cache line, for a series of cache lines, in a short loop.
// This is faster than using software prefetch.  The technique is great for
// getting maximum read bandwidth, especially in DDR memory systems.

// Inline assembly syntax for use with Visual C++

void * memcpy_amd(void *dest, const void *src, size_t n)
{
  __asm {

mov ecx, ; number of bytes to copy
mov edi, ; destination
mov esi, ; source
mov ebx, ecx ; keep a copy of count

cld
cmp ecx, TINY_BLOCK_COPY
jb $memcpy_ic_3 ; tiny? skip mmx copy

cmp ecx, 32*1024 ; dont align between 32k-64k because
jbe $memcpy_do_align ;  it appears to be slower
cmp ecx, 64*1024
jbe $memcpy_align_done
$memcpy_do_align:
mov ecx, 8 ; a trick thats faster than rep movsb...
sub ecx, edi ; align destination to qword
and ecx, 111b ; get the low bits
sub ebx, ecx ; update copy count
neg ecx ; set up to jump into the array
add ecx, offset $memcpy_align_done
jmp ecx ; jump to array of movsbs

align 4
movsb
movsb
movsb
movsb
movsb
movsb
movsb
movsb

$memcpy_align_done: ; destination is dword aligned
mov ecx, ebx ; number of bytes left to copy
shr ecx, 6 ; get 64-byte block count
jz $memcpy_ic_2 ; finish the last few bytes

cmp ecx, IN_CACHE_COPY/64 ; too big 4 cache? use uncached copy
jae $memcpy_uc_test

// This is small block copy that uses the MMX registers to copy 8 bytes
// at a time.  It uses the "unrolled loop" optimization, and also uses
// the software prefetch instruction to get the data into the cache.
align 16
$memcpy_ic_1: ; 64-byte block copies, in-cache copy

prefetchnta ; start reading ahead

movq mm0, ; read 64 bits
movq mm1,
movq , mm0 ; write 64 bits
movq , mm1 ;    note:  the normal movq writes the
movq mm2, ;    data to cache; a cache line will be
movq mm3, ;    allocated as needed, to store the data
movq , mm2
movq , mm3
movq mm0,
movq mm1,
movq , mm0
movq , mm1
movq mm2,
movq mm3,
movq , mm2
movq , mm3

add esi, 64 ; update source pointer
add edi, 64 ; update destination pointer
dec ecx ; count down
jnz $memcpy_ic_1 ; last 64-byte block?

$memcpy_ic_2:
mov ecx, ebx ; has valid low 6 bits of the byte count
$memcpy_ic_3:
shr ecx, 2 ; dword count
and ecx, 1111b ; only look at the "remainder" bits
neg ecx ; set up to jump into the array
add ecx, offset $memcpy_last_few
jmp ecx ; jump to array of movsds

$memcpy_uc_test:
cmp ecx, UNCACHED_COPY/64 ; big enough? use block prefetch copy
jae $memcpy_bp_1

$memcpy_64_test:
or ecx, ecx ; tail end of block prefetch will jump here
jz $memcpy_ic_2 ; no more 64-byte blocks left

// For larger blocks, which will spill beyond the cache, its faster to
// use the Streaming Store instruction MOVNTQ.  This write instruction
// bypasses the cache and writes straight to main memory.  This code also
// uses the software prefetch instruction to pre-read the data.
align 16
$memcpy_uc_1: ; 64-byte blocks, uncached copy

prefetchnta ; start reading ahead

movq mm0, ; read 64 bits
add edi,64 ; update destination pointer
movq mm1,
add esi,64 ; update source pointer
movq mm2,
movntq , mm0 ; write 64 bits, bypassing the cache
movq mm0, ;    note: movntq also prevents the CPU
movntq , mm1 ;    from READING the destination address
movq mm1, ;    into the cache, only to be over-written
movntq , mm2 ;    so that also helps performance
movq mm2,
movntq , mm0
movq mm0,
movntq , mm1
movq mm1,
movntq , mm2
movntq , mm0
dec ecx
movntq , mm1
jnz $memcpy_uc_1 ; last 64-byte block?

jmp $memcpy_ic_2 ; almost done

// For the largest size blocks, a special technique called Block Prefetch
// can be used to accelerate the read operations.  Block Prefetch reads
// one address per cache line, for a series of cache lines, in a short loop.
// This is faster than using software prefetch, in this case.
// The technique is great for getting maximum read bandwidth,
// especially in DDR memory systems.
$memcpy_bp_1: ; large blocks, block prefetch copy

cmp ecx, CACHEBLOCK ; big enough to run another prefetch loop?
jl $memcpy_64_test ; no, back to regular uncached copy

mov eax, CACHEBLOCK / 2 ; block prefetch loop, unrolled 2X
add esi, CACHEBLOCK * 64 ; move to the top of the block
align 16
$memcpy_bp_2:
mov edx, ; grab one address per cache line
mov edx, ; grab one address per cache line
sub esi, 128 ; go reverse order
dec eax ; count down the cache lines
jnz $memcpy_bp_2 ; keep grabbing more lines into cache

mov eax, CACHEBLOCK ; now that its in cache, do the copy
align 16
$memcpy_bp_3:
movq mm0, ; read 64 bits
movq mm1,
movq mm2,
movq mm3,
movq mm4,
movq mm5,
movq mm6,
movq mm7,
add esi, 64 ; update source pointer
movntq , mm0 ; write 64 bits, bypassing cache
movntq , mm1 ;    note: movntq also prevents the CPU
movntq , mm2 ;    from READING the destination address
movntq , mm3 ;    into the cache, only to be over-written,
movntq , mm4 ;    so that also helps performance
movntq , mm5
movntq , mm6
movntq , mm7
add edi, 64 ; update dest pointer

dec eax ; count down

jnz $memcpy_bp_3 ; keep copying
sub ecx, CACHEBLOCK ; update the 64-byte block count
jmp $memcpy_bp_1 ; keep processing chunks

// The smallest copy uses the X86 "movsd" instruction, in an optimized
// form which is an "unrolled loop".  Then it handles the last few bytes.
align 4
movsd
movsd ; perform last 1-15 dword copies
movsd
movsd
movsd
movsd
movsd
movsd
movsd
movsd ; perform last 1-7 dword copies
movsd
movsd
movsd
movsd
movsd
movsd

$memcpy_last_few: ; dword aligned from before movsds
mov ecx, ebx ; has valid low 2 bits of the byte count
and ecx, 11b ; the last few cows must come home
jz $memcpy_final ; no more, lets leave
rep movsb ; the last 1, 2, or 3 bytes

$memcpy_final:
emms ; clean up the MMX state
sfence ; flush the write buffer
mov eax, ; ret value = destination pointer

    }
}
Posted on 2010-08-23 16:34:49 by Scali
Here's what AMD suggests for a memcpy instead:


First of all, we're not talking about various memcpy routines here. Everyone can read about that, for example, in the AMD Software Optimization Guide and other related documents. There's no secret about that.

Yes, and that behaviour is marked as 'deprecated'.


So. Have you really checked the links, I've provided? Here is the MS statement again: http://msdn.microsoft.com/en-us/library/a32tsf7t(VS.80).aspx It's not only true for Visual Studio 2005, but also for Visual Studio 2008 and of course Visual Studio 2010. Is there any better version of VS? Nothing in that statement is marked as 'deprecated', really nothing.

All modern software is written in high-level languages, which provide various optimized libraries to convert from/to decimal, optimized for performance, without the need for BCD hardware support.


It's not a question of speed, it's a question of accuracy. Have you tried to convert 0.1 (decimal) into the binary equivalent before your answer? I'll do that for you: 0.1 dec = 0.0 0011 0011 0011 0011 ... It's periodical, just like 1/3 in the decimal system. The trick is: you've 32 or 64 or 128 or whatever bits to store the binary value, and that stored value is not exactly 1/10 = 0.1 decimal. That effect occurs by 0.1 and it's multiples. Those common numbers are infinite periodic binary fractions. By summing up such numbers, a large amount of errors can occur. With floating point arithmetic, we've to live with those conversion errors. Therefore, approximately 85 - 90% of the worldwide running bank software is written in COBOL and uses BCD arithmetic to avoid these errors.

Just because it seemed a good idea to do that in hardware...


Sure, would that be a better way. But do you have such hardware between your CPU and the compiler?

Gunther
Posted on 2010-08-23 17:26:51 by Gunther

First of all, we're not talking about various memcpy routines here. Everyone can read about that, for example, in the AMD Software Optimization Guide and other related documents. There's no secret about that.


I'm just pointing out that although in the 70s it seemed smart to add rep movs to the hardware, today this is no longer the preferred way to perform a memcpy.
Same thing as BCD and 80-bit floating point.

So. Have you really checked the links, I've provided? Here is the MS statement again: http://msdn.microsoft.com/en-us/library/a32tsf7t(VS.80).aspx It's not only true for Visual Studio 2005, but also for Visual Studio 2008 and of course Visual Studio 2010. Is there any better version of VS? Nothing in that statement is marked as 'deprecated', really nothing.


The more important question here is: Have you checked the links that *I* have provided?
I hate to repeat myself, but apparently there is no way to get through to you... so here it is again:
http://msdn.microsoft.com/en-us/library/ee418798(VS.85).aspx
The x87, MMX, and 3DNow! instruction sets are deprecated in 64-bit modes. The instructions sets are still present for backward compatibility for 32-bit mode; however, to avoid compatibility issues in the future, their use in current and future projects is discouraged.


Happy now?

It's not a question of speed, it's a question of accuracy. Have you tried to convert 0.1 (decimal) into the binary equivalent before your answer? I'll do that for you: 0.1 dec = 0.0 0011 0011 0011 0011 ... It's periodical, just like 1/3 in the decimal system. The trick is: you've 32 or 64 or 128 or whatever bits to store the binary value, and that stored value is not exactly 1/10 = 0.1 decimal. That effect occurs by 0.1 and it's multiples. Those common numbers are infinite periodic binary fractions. By summing up such numbers, a large amount of errors can occur. With floating point arithmetic, we've to live with those conversion errors. Therefore, approximately 85 - 90% of the worldwide running bank software is written in COBOL and uses BCD arithmetic to avoid these errors.


Not sure what your point is here... Nobody said you should use floating point instead of BCD.
And if we're still on the subject of 80-bit precision... as you said it yourself, the problem exists with floating point, no matter how many bits you have.
You can implement BCD just fine without specific hardware support.
For example, .NET offers some fine libraries for exactly this sort of thing, as I said. Things like the Decimal datatype, or the BigInteger.
Posted on 2010-08-24 01:54:05 by Scali
I'm just pointing out that although in the 70s it seemed smart to add rep movs to the hardware, today this is no longer the preferred way to perform a memcpy.


Of course. And it seems obvious that I'm using rep movsb in my memcpy routines. Right?

The more important question here is: Have you checked the links that *I* have provided?


Yes, I did. What can I say? The informations at the Microsoft website are inconsistent. Are you absolutely sure, that your entry is the right one? By the way, The Intel C++ compiler for x64 Windows supports long double precision and __m64 in version 9.0 and later.

Not sure what your point is here


That's surprising. Integers are not the problem. But by converting fractionals from the decimal system into the binary system you've to deal with that conversion errors. It's a simple mathematical question. 

And if we're still on the subject of 80-bit precision... as you said it yourself, the problem exists with floating point, no matter how many bits you have.


Yes, that was my point. You may use float or double; both are expanded inside the FPU into the 80 bit format for doing your computation. If your calculation is finished - not before - the result is rounded back to the appropriate format. And that makes a great difference in comparison to the floating point operations with SSE registers, which can only use 64 bits. I wrote about that point above. The internal 80 bit calculation inside the FPU leads to error reducing. Do you agree?

I've to ask that, because you're jumping from point to point. This trick is simple, but obvious. For example, I wrote about C compilers (software!), which don't support the BCD data type and you came up with memcpy routines. Moreover, you wrote now:

You can implement BCD just fine without specific hardware support. For example, .NET offers some fine libraries for exactly this sort of thing,


Fine. The underlying hardware isn't my problem. But why must we use a seperate .NET library (just another black box) for such calculations? Why is that feature not supported by native C/C++ compilers? Life would be easyer, because BCD math is necessary sometimes. But that's only my personal point of view; it must not be true. You know, I'm timewarped from the 70s.

All together. I'm here to discuss questions and I don't want a fruitless quarreling that just splitting hairs. I think it's enough now. I'm not such a guru and your and my time isn't endless. So, please excuse my marginalia. I won't do that again with you. ;)

Gunther 
Posted on 2010-08-24 05:29:13 by Gunther

Yes, that was my point. You may use float or double; both are expanded inside the FPU into the 80 bit format for doing your computation. If your calculation is finished - not before - the result is rounded back to the appropriate format. And that makes a great difference in comparison to the floating point operations with SSE registers, which can only use 64 bits. I wrote about that point above. The internal 80 bit calculation inside the FPU leads to error reducing. Do you agree?


That depends on whether or not you have the control word set up that way.
By default it is set to double precision, so you are not using the full 80 bit FPU precision:
http://msdn.microsoft.com/en-us/library/y0ybw9fy.aspx
By default, _controlfp's precision control is set to 53 bits (_PC_53).


So unless you specifically changed the control word, I don't agree. There is no extra precision when using x87 over SSE2 by default, in a Windows application.

I've to ask that, because you're jumping from point to point. This trick is simple, but obvious. For example, I wrote about C compilers (software!), which don't support the BCD data type and you came up with memcpy routines.


I already said that you can use libraries (or implement your own). But apparently you didn't understand what I meant by "You don't need BCD support in hardware". I can add to that: "You don't need BCD support in your compiler". You can build your own library, and in most cases someone has done that for you already.

Fine. The underlying hardware isn't my problem. But why must we use a seperate .NET library (just another black box) for such calculations? Why is that feature not supported by native C/C++ compilers? Life would be easyer, because BCD math is necessary sometimes. But that's only my personal point of view; it must not be true. You know, I'm timewarped from the 70s.


You don't *have* to use the .NET library...It's just one of many examples.
You said: "By the way, with an assembly language application, you could support BCD, float, double, long double and a lot of other funny formats".
My point was simply that you don't need assembly for most of these (BCD should NOT be implemented with specific instructions, not even if you are using assembly, just do it the HLL-way), and other 'funny formats' are best avoided altogether (such as 80-bit floats).

All together. I'm here to discuss questions and I don't want a fruitless quarreling that just splitting hairs. I think it's enough now. I'm not such a guru and your and my time isn't endless. So, please excuse my marginalia. I won't do that again with you. ;)


It's not my fault that you dive in head-first. You seemed to think you knew everything better. You were pretty arrogant in trying to 'correct' others and defending yourself. Look before you leap next time.

Oh, and before I forget... Microsoft's lack of long double support is not laziness, but it is for portability reasons (just like defaulting to double precision with the FPU, for example). 80-bit floats aren't going to work on non-x86 platforms like Itanium, MIPS, PowerPC , Alpha and ARM... to name but a few platforms on which Windows is or was available.
Posted on 2010-08-24 06:09:53 by Scali
Yes, I did. What can I say? The informations at the Microsoft website are inconsistent. Are you absolutely sure, that your entry is the right one?
Deprecated doesn't mean "it doesn't work", so I see no inconsistency in the documentation (but information *is* scattered somewhat across pages). Deprecated simply means "you shouldn't be using this" - whether it's because it will eventually be removed, or because performance is worse than alternatives.

Fine. The underlying hardware isn't my problem. But why must we use a seperate .NET library (just another black box) for such calculations? Why is that feature not supported by native C/C++ compilers?
Because C++ (and especially C) has traditionally come with a very small runtime, so it could easily be ported to other platforms... stuff like threading is only being added to C++ in the upcoming C++0x... but nothing stops you from writing/using a portable or platform-optimized bignum library. And an optimized library definitely wouldn't use BCD instructions on x86 :)
Posted on 2010-08-24 14:06:52 by f0dder
Scali, Intel Itanium actually supports double-extended precision.
Posted on 2010-08-24 17:22:04 by LocoDelAssembly