Hey guys, look what I found out:
I wrote a function to calculate the Dot Product of two vectors, and wrote a sample benchmark program, one for FPU one for 3DNow! (no SSE because my ThunderBird doesnt support it).
Anyway here are my results:
AMD THUNDERBIRD 900Mhz SOCKET A, 133Mhz SDRAM

3DNow! dot products per second: 37,174,488 dot products per second
FPU dot products per second: 6,791,262

One question, would it be better if I loaded all the vectors on the stack and computed the dot product that way? Because my 3DNow! function is 5.47x faster than the FPU version. I would like to close the gap, and to think that FPU is pretty damn fast and its only pumping out 6 mil dot products a second compared to my 3DNow! routine.

Here are the routines please criticize them (contructively that is :) ).




GetDotProduct PROC lpvect1:DWORD, lpvect2:DWORD
mov esi,lpvect1
mov edi,lpvect2
fld DWORD PTR [esi] ;Load X1 onto stack
fmul DWORD PTR [edi] ;ST(0)==X1*X2
fld DWORD PTR [esi+4] ;Load Y1
fmul DWORD PTR [edi+4] ;ST(0)=Y2*Y1
faddp st(1),st(0) ;ST(0)==X+y computed
;Now onto Z stuff
fld DWORD PTR [esi+8]
fmul DWORD PTR [edi+8]
faddp st(1),st(0)
;Result in ST(0)
ret
GetDotProduct endp

GetDP3DNow PROC lpv1:DWORD, lpv2:DWORD
LOCAL tempans:DWORD
mov esi,lpv1
mov edi,lpv2
movq mm0,[esi] ;Fetch X and Y of both vectors
movq mm1,[edi]
pfmul mm0,mm1
movd mm3,[esi+8]
pfacc mm0,mm0
movd mm5,[edi+8]
pfacc mm1,mm1
pfmul mm3,mm5
pfacc mm3,mm3
pfadd mm0,mm3
movd tempans,mm0
femms
fld tempans
ret

GetDP3DNow endp


Keep in mind that it only took me 5 minutes to come up with both functions together so they r not optimized.
If anyone else is interested in running the program let me know. Sorry but the thing uses 3DNow! so Intel users machines will crash unfortunately.

I'm sorry but I'm extremely shocked :eek:
I didnt expect 3DNow! to be 5x faster!! on a TBIRD
Posted on 2002-12-29 16:08:43 by x86asm
Interesting...

Since i have a AMD Athlon now i might give it a test in my new realtime (i hope) raytracer i just started to work on (yesterday) ... I have quite a lot of DotProducts there (and i am curently using FPU to do them)... however i am not in the optimization phase now ...
:alright:
Posted on 2002-12-29 17:24:36 by BogdanOntanu

Interesting...

Since i have a AMD Athlon now i might give it a test in my new realtime (i hope) raytracer i just started to work on (yesterday) ... I have quite a lot of DotProducts there (and i am curently using FPU to do them)... however i am not in the optimization phase now ...
:alright:


You may copy and alter my 3DNow! function if you wish to see how it helps in your raytracer :alright:
Posted on 2002-12-29 18:24:04 by x86asm
Another note, I just made a cross product function using the FPU (havent cooked up a 3DNow! one)
Anyway here is the code:


GetCrossProduct PROC lpveca:DWORD, lpvecb:DWORD, lpres:DWORD
LOCAL cleanup:DWORD
;Cross Product:
;(Ay*Bz-Az*By), (Az*Bx-Ax*Bz), (Ax*By-Ay*Bx)
mov esi,lpveca
mov edi,lpvecb
fld REAL4 PTR [esi+4] ;Load Ay onto stack
fmul REAL4 PTR [edi+8] ;ST(0)=Ay*Bz
fld REAL4 PTR [esi+8] ;Load Az onto stack
mov edx,lpres
fmul REAL4 PTR [edi+4] ;ST(0)=Az*By
;ST(1)=Ay*Bz
fxch ;Xchange registers
fsub st(0),st(1) ;Perform subtraction
fstp REAL4 PTR [edx] ;Store in RAM
add edx,4
fld REAL4 PTR [esi+8] ;Load Az onto stack
fmul REAL4 PTR [edi] ;ST(0)==Az*Bx
fld REAL4 PTR [esi] ;Load Ax onto stack
fmul REAL4 PTR [edi+8] ;St(0)==Ax*Bz
fxch ;Xchange
fsub st(0),st(1)
fstp REAL4 PTR [edx] ;Store result into Result space
add edx,4
;Now proceed to last unit ofcross product
fld REAL4 PTR [esi] ;Load Ax onto stack
fmul REAL4 PTR [edi+8] ;st(0)==Ax*By
fld REAL4 PTR [esi+8]
fmul REAL4 PTR [edi] ;st(1)==Ay*Bx
fxch
fsub st(0),st(1)
fstp REAL4 PTR [edx]
fcompp
fstp cleanup
ret

GetCrossProduct endp


I ran a test just now and my TBIRD 900
did:
18,649,285 Cross Products a second!!

The weird thing is compare it to my DotProduct function and look:
My Cross Product calculation is faster than my Dot Product
HOW?!?!?!
This adds even more to my current state of shock.

:eek:
Posted on 2002-12-29 18:54:41 by x86asm
What the?!?!

I just did a test run of the FPU dot Product and it jumped all the way up to:
46,723,552


:eek:

But when I test both 3DNow! and FPU they both drop to 20 something million each.
weird.
Posted on 2002-12-29 19:02:14 by x86asm
Here are my results:
GetDotProduct:   1000 loops, 21014 cycles

GetDP3DNow: 1000 loops, 32271 cycles
GetCrossProduct: 1000 loops, 30014 cycles
These results are very reproducable and they include
the function call overhead (+parameter passing).
Posted on 2002-12-29 19:39:45 by bitRAKE
My 3DNow! routine seems to be slower than my FPU routine, can you make a guess to what the problem may be?
Posted on 2002-12-29 20:05:51 by x86asm
Normally, in high-performance 3DNow! code, all of the 3DNow! instructions are properly scheduled apart from each other so as to avoid delays due to execution resource contentions (as well as taking into account dependencies and execution latencies).
There is a two cycle latency - look at the forward dependancies! Two dot products could be executed in parallel in the same number of cycles!
GetDP3DNow PROC lpv1:DWORD, lpv2:DWORD

LOCAL tempans:DWORD
mov esi,lpv1
mov edi,lpv2
movq mm0,[esi] ;Fetch X and Y of both vectors
movq mm1,[edi]
pfmul mm0,mm1
movd mm3,[esi+8]
pfacc mm0,mm0
movd mm5,[edi+8][COLOR=red]
pfacc mm1,mm1 ; What is this for?[/COLOR]
pfmul mm3,mm5[COLOR=red]
pfacc mm3,mm3 ; Are you sure this is correct?[/COLOR]
pfadd mm0,mm3
movd tempans,mm0
femms
fld tempans
ret
GetDP3DNow endp
Posted on 2002-12-29 23:15:42 by bitRAKE
The pfacc's are to accumulate the two REAL4's in the registers. yes they are correct they work. Are they slowing it down? How would you do it?
Posted on 2002-12-30 08:00:00 by x86asm
Is it a good idea to accumulate the register into another 3DNow! register to eliminate dependancies.
Posted on 2002-12-30 08:02:20 by x86asm
MM1 is never used again after PFACC - how can that be right? :)
Please, take a look in a debugger with some test values.
Posted on 2002-12-30 11:11:21 by bitRAKE
OK I did something wrong here is the fixed one (somewhat):



LOCAL tempans:DWORD
mov esi,lpv1
mov edi,lpv2
movq mm0,[esi] ;Fetch X and Y of both vectors
movq mm1,[edi]
pfmul mm0,mm1
movd mm3,[esi+8]
pfacc mm0,mm0
movd mm5,[edi+8]
pfmul mm3,mm5
pfadd mm0,mm3
movd tempans,mm0
femms
fld tempans
ret


Here are the vectors I used
vec1 dd 0.984253f, 0.174552f, -0.027857f
vec2 dd -0.131588f, 0.618331f,-0.774823f

Here are the results displayed by OllyDBG:
FPU: -7.2829425334930419920e-07
3DNow!: -7.234874174001010380e-07

The accuracy is that acceptable? The difference? the precision?
Posted on 2002-12-30 12:07:46 by x86asm
LOL! Sorry BitRake I felt stupid :o
Posted on 2002-12-30 12:08:28 by x86asm