I have implemented a basic FOP loop using xmm registers. Suggestions are welcome. This is the 3nd revision. The code now works for negative step size and target number. Float2 and Float3 are evaluated at run time and stored in local variables normally. The purpose of this exercise is to test how certain language constructs would be translated into asm from a scripting language.

``// for x = float1 to float2 step float3		movsd xmm1,       // init loop counterforloop:		mov eax, dword ptr     // high dword contains sign bit		shl eax, 1      // get high (sign) bit as CF		jc negcompare       // CF = 1 if negative stepsize		comisd xmm1, 		ja endloop      //; xmm1 > float2 --> ZF/CF = 0		jmp loopbodynegcompare:		comisd xmm1, 		jb endloop       //; xmm1 < float2 --> CF = 1loopbody:		movsd , xmm1   //; save value					//nop //; some loop code here		movsd xmm1,       //; restore value		addsd xmm1, 		jmp forloopendloop:		nop``
Posted on 2009-05-05 06:36:02 by BinaryAlgorithm
ouch. But hey, it's not like you're using BigNums
Posted on 2009-05-05 10:06:45 by Ultrano
ouch ^^ BinaryAlgorithm, you seem to miss the point of SIMD. Read my answer in your other topic.
Posted on 2009-05-05 17:06:24 by ti_mo_n
I think I understand what you were saying. After looking at the instruction set more, it was obvious that doing floating point operations in parallel was the main reason for the XMM registers. I haven't profiled the speed yet, but based on the documentation I have the XMM scalar FP operations are faster than standard FP operations. Also, they are directly referenced instead of being stack based (not much experience with the FP units, but I think it's more intuitive). I doubt that using it for scalar integer ops would be effective except perhaps for division, but the difference I think is marginal. I think it is a faster and easier replacement for the older FP logic (correct me if I'm wrong).

I found an instruction that affects the EFLAGS directly like a normal comparison: the 'comisd' instruction. This is nice because it prevents having to save to memory, load to a register, and compare.

The loop takes about 9.5 cycles, where a similar loop based on eax is 6. It should be noted that using XMM registers instead of memory operands only improves it to 9 cycles, so there is really no penalty to using neaby cached data. Does this arguement make sense to use XMM for FP at least?
Posted on 2009-05-05 18:47:08 by BinaryAlgorithm

I haven't profiled the speed yet, but based on the documentation I have the XMM scalar FP operations are faster than standard FP operations. Also, they are directly referenced instead of being stack based (not much experience with the FP units, but I think it's more intuitive). I doubt that using it for scalar integer ops would be effective except perhaps for division, but the difference I think is marginal. I think it is a faster and easier replacement for the older FP logic (correct me if I'm wrong).

In fact, one of the main reasons for adding the scalar functions to SSE2 was to phase out the x87 FPU.
SSE2 was first added to the Pentium 4. The Pentium 4 is designed without an actual x87 FPU. Instead, both x87 and SSE use the same SIMD units. The x87 code is 'emulated' with the SIMD instructions. This is why especially the Pentium 4 has very poor x87 performance. Athlons and Core2/Core i7 have better x87 performance, so the difference with SSE2 is not that large, but still SSE2 is generally as good or better.

If you use Pentium 4 optimization in Visual C++, you'll see that it will actually use SSE2 code for nearly all floating point operations, and even transcendental functions etc in the C library will mostly use SSE2-code.

In fact, in 64-bit mode the x87 context is not saved by the Windows context switcher, and all floating point code should be SSE2. So 64-bit compilers will always generate SSE2-code for any floating point stuff (since any 64-bit processor will have SSE2 or better, this is not a problem).
Posted on 2009-05-06 02:25:06 by Scali