Ok not performance-wise but...

I've created a software renderer which supports ps 2.0 shaders. Unlike the reference rasterizer, it is not an interpreter but an optimized JIT compiler using SIMD instructions. So it is dozens of times faster than refrast and beats hardware in flexibility. For one, it is no problem to go beyond the 32 texture instructions limit in ps 2.0 hardware, while still keeping things real-time.

You can read the details here: swShader.
Posted on 2003-05-17 15:54:17 by C0D1F1ED
Looking :)
Posted on 2003-05-17 22:57:00 by Homer
woah impressive stuff,
Posted on 2003-05-20 14:48:58 by x86asm

woah impressive stuff,


No support for Athlon x87 FPU/3DNow! ?
Posted on 2003-05-20 14:50:14 by x86asm
Originally posted by x86asm No support for Athlon x87 FPU/3DNow! ?

3DNow! is far inferior to SSE, but feel free to to make a 3DNow! implementation, the whole source code is in the package...
Posted on 2003-05-20 18:11:35 by C0D1F1ED


3DNow! is far inferior to SSE, but feel free to to make a 3DNow! implementation, the whole source code is in the package...


awww...3DNow! isnt that bad. I know its missing a packed divide instruction (one thing I think is a serious defect in 3DNow!) but 3DNow can speed up most 3D calculations.
Posted on 2003-05-20 20:01:48 by x86asm
A good FPU version would be better than 3DNow, imho.
Posted on 2003-05-20 20:31:33 by bitRAKE
awww...3DNow! isnt that bad. I know its missing a packed divide instruction (one thing I think is a serious defect in 3DNow!) but 3DNow can speed up most 3D calculations.

A good FPU version would be better than 3DNow, imho.

I need integer and floating-point calculations at the same time. With 3DNow!, you have to share eight registers with MMX. Add to this the fact that 3DNow! can only store two floats in an MMX register, while I mostly use four, and you end up with way too little registers. My bilinear texture sampler already uses all eight MMX registers. It's not a limit though, because I implemented an automatic register allocator. So you certainly can do it with this little registers, but it will produce lots of spilling code, that's extra move instructions to free registers by copying them to memory. So it's going to be a lot slower.

The FPU is even less of an option because they don't allow to share registers with MMX since it's stack based. This means an extra instruction for switching modes is needed and you have to write all registers to memory if they contain useful data. I didn't implement an automatic register allocator for the FPU, I'm not even sure if it's possible. But it's certainly possible if you manage the registers yourself. Again you're going to need lots of spilling code though and since it only operates on single scalars you don't get the advantages of SIMD.

With SSE, the whole situation changes. MMX has it's own eight registers and SSE also. So you don't have balance register usage any more and it's all much less complex. As I mentioned before, SSE registers can hold four floats, and process them all in parallel. So you need a lot less instructions and because the actual number of bytes in registers has tripled you need a lot less spilling (hardly any).

Feel free to prove wrong though. I'd really love to see a 3DNow! implementation. It's all in the Shader/PS20Assember.cpp file...
Posted on 2003-05-21 08:17:49 by C0D1F1ED