When compiling a scripting language to ASM, should I only use the float64 instructions and format, or is the float128 widely supported now? It looks like instructions which take 128-bit parameters are really working with packed 32/64-bit floats. I would like to stick with natively supported formats for speed as opposed to doing conversions, since I don't necessarily need float128 if it would be slower. Guess: based on other languages, float64 is the prefered format (ie, "double" in most languages).
Posted on 2009-04-30 17:34:38 by BinaryAlgorithm
float64 is more widely used, yes

internally i think the FPU uses float80
Posted on 2009-04-30 19:38:13 by comrade
Looking at some instruction set references, the FPU is 80-bit, MMX is for 64-bit integer ops, and the XMM registers are for packed 32/64-bit floats:

FPU 80 st0 st1 st2 st3 st4 st5 st6 st7
MMX 64 mm0 mm1 mm2 mm3 mm4 mm5 mm6 mm7
SSE 128 xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7

I'm wondering though if the later instruction sets beyond SSE have support for 128-bit FP operations, but I haven't seen anything yet to suggest that it can be done with native instructions...

One question is whether floating point is faster that integer in general. For example, defaulting all numbers to being floating point in a compiler. However, the operations require more setup to push those values into the FP registers. I think maybe having register variables could help if the compiler is smart, but integer calculations are more straightforward to optimize (?)
Posted on 2009-04-30 22:14:08 by BinaryAlgorithm
1. Floating point operations are slower in general, so defaulting all numbers to FP will most likely decrease the speed of a program.
2. SSE2 introduced 64-bit FP SIMD operations and it is widely supported today.
3. AFAIK, there is no x86-type CPU with native support of 128-bit FP data type/operations.
4. My personal opinion is that if a program requires 128-bit precision then it means that this program is badly designed. There are many math tricks you can utilize to perform incredibly precise calculations even with 32-bit precision. So first try to optimize your algorithm before trying to optimize the code.
Posted on 2009-05-01 01:41:37 by ti_mo_n
SSE stands for Streaming SIMD Extensions, and SIMD stands for Single Instruction Multiple Data. These extensions were specifically designed for working with packed registers containing more than one value.

I dont know about FPU vs int perf on modern CPUs, but I do know its been getting better and better.
Posted on 2009-05-01 01:42:41 by comrade

I dont know about FPU vs int perf on modern CPUs, but I do know its been getting better and better.

In short, simple operations like add/sub are normally faster on the ALU than on the FPU... however, div and mul are usually faster in floating point.
Converting between int and float also has some overhead, so it's not always easy to blend ALU and FPU code to get the best of both worlds.
And in most cases, SSE2 is faster than the FPU, even if you don't make use of the full packed registers, but only use the scalar operations in SSE2. SSE2 also gives you fast approximations of 1/x and 1/sqrt(x), which allow you to implement very fast divisions and sqrt operations (although probably not that useful for a scripting language, as you want to offer full precision to the programmer).

Like ti_mo_n said, float128 doesn't exist on x86 (and most other common CPUs).
And I also agree with ti_mo_n that if you need more than double precision, you may want to rework your algorithm. It may also be possible that floating point is not what you want in the first place. Eg with things like encryption, you cannot tolerate any rounding errors or inaccuracies. In such cases you generally use a special library which can handle arbitrarily large integers.
Posted on 2009-05-01 05:48:53 by Scali
The XMM registers are nice. I was kind of dreading working with the stack-based FP units. I prefer directly stating the register I want to work with. I would be using it in scalar mode for 32-bit integer ops or 64-bit float ops; if you have a good reference guide with examples (other than Intel docs) would be helpful. Do pretty much all the assemblers support these instructions?

For example, if the implementation of a FOR loop has two variants, one with integers and one with FP components, there are two options for each: use eax etc, XMM integer scalar ops, and for FP use standard FP or XMM scalar FP ops. I'm unsure if using XMM for integer ops is faster or slower, but I would tend to think for FP it is faster.
Posted on 2009-05-04 23:27:31 by BinaryAlgorithm
If you're using an integer for your loop condition, then generally the ALU is the fastest way. There's some overhead in getting from integers in XMM registers to an actual status flag that you can do a conditional jump on.
With floating point, you need to go through trouble to get the status flag anyway, and in that case XMM is probably better in most cases.
Posted on 2009-05-05 09:42:38 by Scali
MMX/SSE instructions are generally faster, but what is slower is the conversion between MMX/SSE data and the standard x86 data. So, for example, in a loop, you need to load a variable into MMX register, work on it, then send it back to a standard x86 register to set the status flags. MMX/SSE are not meant to be used as general purpose instructions. They are not general purpose - their purpose is strictly defined ("multimedia extensions", "streaming simd extensions"). So it's more like: You have some small/medium sized calculation and you see that it would benefit from SIMD, so you rewrite the calculation itself to MMX/SSE. You get a code of block which, on enter, gets some data, processes it (hopefully faster than the original code), and on exit it gives the result. The whole idea of these instructions is to avoid conditional instructions. Compare standard code which adjusts the brightness of an image, where you need to check for overflows (to avoid 'burnouts'), with MMX one, where overflows are being handled by the instructions. You just repeat 2-3 MMX instructions throughout the whole image. So the image is the input data here, you process it SIMD-way, you get the processet image (output data). That's how SIMD are supposed to be used. Find places in your code that perform just brute-force processing of one data into another. And then convert it to SIMD.

SIMD speeds up the code when your algorithm can no longer be optimized. Naturally, you start by optimizing the algorithm, not the code.
Posted on 2009-05-05 17:03:02 by ti_mo_n