? I wrote a new tutorial on masmforum.? The tutorial covers 3 things.? ?It compares a routine written in C and the same routine written in assembler to see how you can convert the routine to assembler and make it faster than the C code.? It analyzes the assembler code the VC++ 6.0 compiler generates for the C routine, and shows you why VC++ 6.0 will never outperform a good assembler optimizer.? And third it covers scalar SSE as a replacement for floating point.? Scalar SSE operates on one floating point value at a time just like floating point.? Most of the opcodes for floating point are also in scalar SSE.? It also shows you how to do a scalar compare that sets EFLAGS so you can do a conditional jump.? You can read it here:

http://www.masmforum.com/simple/index.php?topic=1140.0
Posted on 2005-03-21 21:39:10 by mark_larson
Interesting - a few points, though. You don't list the switches you used when compiling. And why are you benching against age-old vc6, when the vc2003 toolkit is available for free? :)
Posted on 2005-03-24 09:22:51 by f0dder

Interesting - a few points, though. You don't list the switches you used when compiling. And why are you benching against age-old vc6, when the vc2003 toolkit is available for free? :)



I realized I forgot the switches after I posted the article.? I used the standard default switches.?

I have VC++ 6.0 installed.? Installing vc2003 would have probably messed up my VC++ 6.0 install.? I didn't want to take the chance since it took me so long to get the VC++ 6.0 installed.? I had to install the CDs, install the MSDN, install the SP5, install PP5 ( processor pack), so it's a big pain.? I have not seen big performance gains on webpages on the net when going from VC++ 6.0 to vc2003, so it's not that big a deal to me.
Posted on 2005-03-24 23:08:47 by mark_larson
I would tend to agree with that, the /G7 switch for the VCTOOLKIT vesion of CL does not seem to generate faster code than the /G6 switch from the VC6 version of CL. For whatever reason, I still get slightly smaller code from VC6 than the VCTOOLKIT.

This much, if you want to use the assembler output as a source to optimise the VC code, turn all of the optimisation off and you get enough registers to work with if you know what you are doing. The optimised output generally removes the stack frame and uses all of the registers and it is almost impossible to modify to get its speed up. You have to remove a lot of redundant loads and stores with unoptimised output but you can properly use and reuse registers when you have enough to start with.
Posted on 2005-03-24 23:44:09 by hutch--

I realized I forgot the switches after I posted the article.  I used the standard default switches.

...Which means you're comparing un-optimized C code with optimized assembly. This probably explains why the generate assembly has so many memory references.


Installing vc2003 would have probably messed up my VC++ 6.0 install.

Not if you install the (free) toolkit, just install to a separate location. It does generate better code, especially when you have larger projects and use "full program optimization" and "link-time code generation". Also, it has much better C++ support, which probably won't matter too much to you, though :)
Posted on 2005-03-25 07:31:30 by f0dder


I realized I forgot the switches after I posted the article.? I used the standard default switches.

...Which means you're comparing un-optimized C code with optimized assembly. This probably explains why the generate assembly has so many memory references.


It's optimized.? Standard VC++ optimziation switches are -O2, which means I am comparing optimized C code to assembly.? You need to do a debug build to turn off all optimizations, and then look at the code.? It's a lot worse.? That's why I was so upset with the number of memory accesses they do with -O2 optimization.? I usually do __fastcall and __forceinline as well to get speed ups.? But when I did my testing I had cut and pasted the procedure directly inside the loop.? For my inline assembler if you do the whole procedure in assembler you can use "__declspec(naked)", which gets rid of the prologue and epilogue code that VC++ generates.? That'll give you a speed up to for inline assembler code.

Posted on 2005-03-25 11:25:34 by mark_larson

It's optimized.  Standard VC++ optimziation switches are -O2, which means I am comparing optimized C code to assembly.

You mean, _your_ standard, like your CL environment variable?  IIRC, MSC has never set optimization level without being told explicitly.

Anyhow, the poor machine code generation seems to stem from FPU stack.  I have yet to see an intelligent x86 C compiler that utilizes 8 FPU registers fully.  (The free icc binary for linux does that in some limited ways.)

Looking at the code, it seems that direct translation to FPU code may not be much slower - though I have not tried.  One bottleneck might be fsqrt that may take too long time to give us unnecessarily high precision for the purpose of the function.

Another idea:  How about using movups?  (I suspect that must have been considered already, but it is not explicitly mentioned.)  One mulps might be better than 3 mulss, don't you think?
Posted on 2005-03-25 22:02:27 by Starless


It's optimized.? Standard VC++ optimziation switches are -O2, which means I am comparing optimized C code to assembly.

You mean, _your_ standard, like your CL environment variable?? IIRC, MSC has never set optimization level without being told explicitly.


VC++ for release builds defaults to -O2 for the optimization switches.? I double checked to make sure it had -O2 for the release build.? I also looked at the code that was generated for a Debug build, and it was a lot worse than the -02 release build version, since optimizations are turned off for a debug build.



Anyhow, the poor machine code generation seems to stem from FPU stack.? I have yet to see an intelligent x86 C compiler that utilizes 8 FPU registers fully.? (The free icc binary for linux does that in some limited ways.)

Looking at the code, it seems that direct translation to FPU code may not be much slower - though I have not tried.? One bottleneck might be fsqrt that may take too long time to give us unnecessarily high precision for the purpose of the function.


I also haven't seen any great FP C code to FP assembler code from any C compilers either.? That is why I did not play heavily with the optimization switches, because more than likely I would have seen no gain.? I think converting the code I wrote to FP code would still be a lot faster.? The reason is they do all those dang memory accesses in the converted FP C assembler code.

? As far as the square root, I stuck to a single precision square root.? For both scalar SSE and FP it takes 27 cycles to compute a floating point square root on my P4.? That's about twice as fast as it takes on a P3.? Since I am optimizing for my system, I saw no reason to do anything to the square root.



Another idea:? How about using movups?? (I suspect that must have been considered already, but it is not explicitly mentioned.)? One mulps might be better than 3 mulss, don't you think?



? If you've ever done a lot of SSE/SSE2 programming you avoid MOVUPS/MOVDQU like the plague.? It takes almost twice as long to execute as its aligned counterparts.? And it's just easier to align your data on a 16 byte boundary and use MOVAPS/MOVDQA.? The Scalar SSE code operates on one ray and one sphere at a time.? I did a Packed SSE version that operates on 4 rays and 1 sphere at a time.? It runs in 25 cycles a ray-sphere intersection ( 100 cycles total time, but 25 cycles per ray-sphere interesction, because you do 4 rays at a time).

? I did a version using MULPS and doing most of the other multiplies in parallel, it runs a lot slower than the scalar SSE version.? I'll have to dig up the timings.? If I remember right it was about 20 cycles faster that the C FP code.? Whereas my scalar SSE code is twice as fast.?
Posted on 2005-03-26 20:19:10 by mark_larson