The implemented proc adds an array of floats: Destination+=Source, on first run it selects either to use SSE or FPU onwards.
called like this:

Scall Float_AddArrays,pDestination,pSource,NumFloats

; same as:
push NumFloats
push pSource
push pDestination
call Float_AddArrays

I implemented check-up whether the source and destination are 16-byte aligned, and then use one of 5 algorithms to do the work, 4 of which have 3 sections: large-blocks, med-blocks,small-blocks.

Having finished the code, I hurried to see the results. And they surprised me a lot - SSE was slower, much slower than FPU!
On my AthlonXP2000+, DDR400, 100 times repeated addition of two arrays, each with 1500 floats took:
SSE = 1445614 cycles , => 9.637 cycles/result
FPU = 552925 cycles, => 3,68 cycles/result

The 2 arrays are 16-byte aligned. I did tests on unaligned arrays, there SSE performed even more worse (though with just 0.1 - 1.0 cycles/result)

Please see if I've incorrectly optimized the SSE code. Though I really doubt that. I'm more inclined to think that AMD really have boosted the FPU a lot ^_^

Please test the code on different machines, and give feedback ^^

: just a note - I came up with/remembered a 6th algorithm: when esi and edi are 4-bytes aligned, and (esi&15)==(edi&15), we could first add 1,2 or 3 floats, then process the rest - which turns out to be 16-bytes aligned ^^ . Implemented it and updated the .zip
Posted on 2005-11-04 21:34:19 by Ultrano

I downloaded your little prog on my PC ...

and it doesn't work... ( I have a WinXP64Bits)

About FPU vs SSE...

it seems to me that you are rigth because FPU uses a little prog put in memory of processor...

therefore with a same processor...FPU has more change to be quicker...

Have a look all the same to your prog....

Good Enjoy..

Posted on 2005-11-04 23:14:10 by gerard
I forgot to mention you need VKDebug, included in MASM7, and that you need to run this proc it in the same partition as your MASM installation.. If the debugging lib doesn't find the dbgwin.exe, it shows nothing.
Posted on 2005-11-05 05:34:07 by Ultrano

I forgot to mention you need VKDebug, included in MASM7, and that you need to run this proc it in the same partition as your MASM installation.. If the debugging lib doesn't find the dbgwin.exe, it shows nothing.

Can't you fix up a version that doesn't require this? I'm not very inclined to install MASM32, but I'd like to test your code on my dualcore :)
Posted on 2005-11-05 09:19:26 by f0dder
Yes, ready - updated the attachment in the first post.
Hmm maybe now the load on my cpu is lower, so  I get better results:
SSE took 481955 cycles ; 3.21 cycles/result
FPU took 378402 cycles ; 2.52 cycles/result

Maybe since I was running three servers and two IMs ...

I remembered - I just set up the data to be 16-byte aligned. With unaligned data, the SSE result is around 700000 cycles
Posted on 2005-11-05 09:34:30 by Ultrano

SSE took 366015 cycles,
FPU took 375246 cycles

AMD64 x2 4400+, each core at 2.21GHz and with 1meg of L2 cache per core.
Posted on 2005-11-05 09:42:57 by f0dder
Now finally put good benchmarking code. Sorry for the many updates ^^" .
So, the final, best results on my PC are:

SSE: 4492 cycles = 2.99 cycles/result
FPU: 2870 cycles = 1.91 cycles/result

These are pretty constant, with max 0.5% error

I don't like benchmarks that take the best results, but these have to be done too. Anyway, I now really fancy the FPU against SSE :) .
Posted on 2005-11-05 10:16:44 by Ultrano

SSE took 3443 cycles = 2.295333 cycles/result,
FPU took 3032 cycles = 2.021333 cycles/result probably run on too little data :) (not to mention that AMD doesn't have as good SSE/2/3 implementation as intel has, especially on pre-AMD64 CPUs).
Posted on 2005-11-05 10:25:57 by f0dder
Well, I need this proc to mix the audio data (2kB max) of 50-700 "virtual cables", while at least 1 of the 2 streams for each mix is in the d-cache. Yes, AMD's SSE isn't as good as Intel's, but fortunately I target AMD cpus mostly :) . And having seen Dest+=Src takes 1.91 cycles... I find it awesome ^^ .
Posted on 2005-11-05 11:11:20 by Ultrano
Hi Ultrano
Results using Intel PIII 500MHz

SSE took 8760 cycles = 5.84 cycles/result,
FPU took 4107 cycles = 2.738 cycles/result

Posted on 2005-11-05 11:19:39 by Biterider
SSE is slower? Its not what i get on my bench:

FPU : 151
XMM : 56
XMML : 44
3DNOW : 71

I guess your implementation isn't caching properly.
Posted on 2005-11-05 12:30:50 by Eduardo Schardong
ZZZ_Test2 on an AMD Duron 1.6 threw this:

SSE took 4474 cycles = 2.982667 cycles/result
FPU took 2870 cycles = 1.913333 cycles/result
Posted on 2005-11-05 13:02:17 by Kecol

FPU : 149
XMM : 74
XMML : 54
3DNOW : 86

...and with affinity set so it only gets scheduled to CPU-1:

FPU : 168
XMM : 42
XMML : 52
3DNOW : 72

Posted on 2005-11-05 13:15:17 by f0dder
Pentium 4 2.4 GHz

SSE took 7992 cycles = 5.328 cycles/result,
FPU took 5112 cycles = 3.408 cycles/result
Posted on 2005-11-05 14:07:28 by ti_mo_n

AthlonXP2000+ , 1.67GHz
FPU : 154
XMM : 56
XMML : 44
3DNOW : 72

Good grief, Eduardo - comparing SSE code that handles only 16-byte aligned data, against a poorly-written fpu loop ^^". And the results of the zzz_test2 (mine) and zzz_test3(yours) are incomparable, since yours has bonuses:
- process priority is realtime (weakest bonus)
- subtracts the cycles taken for 10M empty-loops, after it gets the result. (strongest argument)
- is tested only on 32 bytes, in other words most of the write-back cycles are ignored...(medium-strong arg)

But the differences betwen XMMLoop and XMMSingle in results and code are interesting, a nice learning material for me, thanks ^^ . I've gotta burn/delete some optimization tutorials that stated reading/writing from memory backwards was a bit slower. But after I thoroughly test it :)
Posted on 2005-11-05 14:25:33 by Ultrano
Hmm.. under my conditions (zzz_test2), your SSE code does the job in 1.04 cycles/result :)
Interesting, how here we shouldn't preload registers. And usage of only 1 register can do the trick.. And reading+writing memory backwards is as fast or faster than forward..
Posted on 2005-11-05 14:54:11 by Ultrano
When you show the result of your prog i got surprised about the time of SSE and FPU, so i made that test to see if SSE is really slower, both FPU and XMML procs run with a simple loop (not optimized) on HIGH_PRIORITY_CLASS and SSE was fast, i see two reason for that:
1) SSE do 4 floats adds at once, 4 times less memory reads, 4 times less memory writes.
2) FPU always do 64 bit precision operations against the 24 bits of FPU.

I guess the diference between XMML and XMMSingle occurs because of the caching (on the fooder's X2 wiht 1MB cache XMMSingle was faster), the 24 SSE instructions of XMMSingle gets 93 bytes (108 bytes for the proc) while the loop of XMML gets 20 bytes (41 bytes for the proc).

I made another try, change the loop of XMML to:

   sub eax, 4
   align 16
       movaps XMM0,
       addps XMM0,
       movaps , XMM0
   sub al, 4

Now it gets only 16 byte, even with an extra instruction (the sub eax, 4 on the top) the proc was 1 cycle faster (can get more for bigger arrays).

If you reduce the size of your proc maybe the SSE runs faster.
Posted on 2005-11-05 15:04:34 by Eduardo Schardong
My guess that xmm loop was so fast (compared to xmmSingle-based procs) is:
(cycles are for every 4 results. Total saved are 8 cycles per 4 floats)
- align 16 -  1 cycle
- small loop - 1..4 cycles. The cpu has the CISC instructions converted to RISC, no need for extra conversion on each loop (or every 8..32 instructions)
- not incrementing ESI and EDI - 2 cycles
- using the ECX as base index, strangely is quicker than I thought.

What I mostly wonder about is why
movaps xmm0,
seems 0.25 cycles faster than
movaps xmm0,
on my cpu. o_O . I guess it's for opcode-size and alignment again.

Hmm :) now fortunately being proven wrong, I can continue studying SSE, to make some useful lib for my DSP stuff ^^.

Btw, I've noticed with latest optimizations, most speedups are explained with "I guess..." ^^" . I'll have to do it the Russian way - "experiment with bruteforce until you achieve the wanted result" :) . Of course, after being away from optimizing x86 code for 1 year, I might be missing some new tutorials/documentations ^^". Also, maybe ARM-cpus optimization became etched into my mind :|

Again, thanks, Eduardo :) .
Posted on 2005-11-05 16:07:58 by Ultrano
Try changing the FPU flag precision stuff... and see if it has any effect,
Posted on 2005-11-05 20:40:13 by f0dder
I tryed it fodder, the only difference is if i not set the PM bit i got a precision exception, according to AMD manual:

x87 instructions carry out all computations using the 80-bit double-extended-precision format. When an x87 instruction reads a number from memory in 80-bit double-extended precision format, the number can be used directly in computations, without conversion. When an x87 instruction reads a number in a format other than double-extendedprecision format, the processor first converts the number into
double-extended-precision format. The processor can convert numbers back to specific formats, or leave them in doubleextended-precision format when writing them to memory.

The precision you choose won't make difference in the time (only generate a precision exception), so always use 64 bit precision  :)
Posted on 2005-11-06 10:57:21 by Eduardo Schardong