Hey guys, look what I found out:

I wrote a function to calculate the Dot Product of two vectors, and wrote a sample benchmark program, one for FPU one for 3DNow! (no SSE because my ThunderBird doesnt support it).

Anyway here are my results:

AMD THUNDERBIRD 900Mhz SOCKET A, 133Mhz SDRAM

3DNow! dot products per second: 37,174,488 dot products per second

FPU dot products per second: 6,791,262

One question, would it be better if I loaded all the vectors on the stack and computed the dot product that way? Because my 3DNow! function is 5.47x faster than the FPU version. I would like to close the gap, and to think that FPU is pretty damn fast and its only pumping out 6 mil dot products a second compared to my 3DNow! routine.

Here are the routines please criticize them (contructively that is :) ).

Keep in mind that it only took me 5 minutes to come up with both functions together so they r not optimized.

If anyone else is interested in running the program let me know. Sorry but the thing uses 3DNow! so Intel users machines will crash unfortunately.

I'm sorry but I'm extremely shocked :eek:

I didnt expect 3DNow! to be 5x faster!! on a TBIRD

I wrote a function to calculate the Dot Product of two vectors, and wrote a sample benchmark program, one for FPU one for 3DNow! (no SSE because my ThunderBird doesnt support it).

Anyway here are my results:

AMD THUNDERBIRD 900Mhz SOCKET A, 133Mhz SDRAM

3DNow! dot products per second: 37,174,488 dot products per second

FPU dot products per second: 6,791,262

One question, would it be better if I loaded all the vectors on the stack and computed the dot product that way? Because my 3DNow! function is 5.47x faster than the FPU version. I would like to close the gap, and to think that FPU is pretty damn fast and its only pumping out 6 mil dot products a second compared to my 3DNow! routine.

Here are the routines please criticize them (contructively that is :) ).

```
```

GetDotProduct PROC lpvect1:DWORD, lpvect2:DWORD

mov esi,lpvect1

mov edi,lpvect2

fld DWORD PTR [esi] ;Load X1 onto stack

fmul DWORD PTR [edi] ;ST(0)==X1*X2

fld DWORD PTR [esi+4] ;Load Y1

fmul DWORD PTR [edi+4] ;ST(0)=Y2*Y1

faddp st(1),st(0) ;ST(0)==X+y computed

;Now onto Z stuff

fld DWORD PTR [esi+8]

fmul DWORD PTR [edi+8]

faddp st(1),st(0)

;Result in ST(0)

ret

GetDotProduct endp

GetDP3DNow PROC lpv1:DWORD, lpv2:DWORD

LOCAL tempans:DWORD

mov esi,lpv1

mov edi,lpv2

movq mm0,[esi] ;Fetch X and Y of both vectors

movq mm1,[edi]

pfmul mm0,mm1

movd mm3,[esi+8]

pfacc mm0,mm0

movd mm5,[edi+8]

pfacc mm1,mm1

pfmul mm3,mm5

pfacc mm3,mm3

pfadd mm0,mm3

movd tempans,mm0

femms

fld tempans

ret

GetDP3DNow endp

Keep in mind that it only took me 5 minutes to come up with both functions together so they r not optimized.

If anyone else is interested in running the program let me know. Sorry but the thing uses 3DNow! so Intel users machines will crash unfortunately.

I'm sorry but I'm extremely shocked :eek:

I didnt expect 3DNow! to be 5x faster!! on a TBIRD

Interesting...

Since i have a AMD Athlon now i might give it a test in my new realtime (i hope) raytracer i just started to work on (yesterday) ... I have quite a lot of DotProducts there (and i am curently using FPU to do them)... however i am not in the optimization phase now ...

:alright:

Since i have a AMD Athlon now i might give it a test in my new realtime (i hope) raytracer i just started to work on (yesterday) ... I have quite a lot of DotProducts there (and i am curently using FPU to do them)... however i am not in the optimization phase now ...

:alright:

Interesting...

Since i have a AMD Athlon now i might give it a test in my new realtime (i hope) raytracer i just started to work on (yesterday) ... I have quite a lot of DotProducts there (and i am curently using FPU to do them)... however i am not in the optimization phase now ...

:alright:

You may copy and alter my 3DNow! function if you wish to see how it helps in your raytracer :alright:

Another note, I just made a cross product function using the FPU (havent cooked up a 3DNow! one)

Anyway here is the code:

I ran a test just now and my TBIRD 900

did:

18,649,285 Cross Products a second!!

The weird thing is compare it to my DotProduct function and look:

My Cross Product calculation is faster than my Dot Product

HOW?!?!?!

This adds even more to my current state of shock.

:eek:

Anyway here is the code:

```
```

GetCrossProduct PROC lpveca:DWORD, lpvecb:DWORD, lpres:DWORD

LOCAL cleanup:DWORD

;Cross Product:

;(Ay*Bz-Az*By), (Az*Bx-Ax*Bz), (Ax*By-Ay*Bx)

mov esi,lpveca

mov edi,lpvecb

fld REAL4 PTR [esi+4] ;Load Ay onto stack

fmul REAL4 PTR [edi+8] ;ST(0)=Ay*Bz

fld REAL4 PTR [esi+8] ;Load Az onto stack

mov edx,lpres

fmul REAL4 PTR [edi+4] ;ST(0)=Az*By

;ST(1)=Ay*Bz

fxch ;Xchange registers

fsub st(0),st(1) ;Perform subtraction

fstp REAL4 PTR [edx] ;Store in RAM

add edx,4

fld REAL4 PTR [esi+8] ;Load Az onto stack

fmul REAL4 PTR [edi] ;ST(0)==Az*Bx

fld REAL4 PTR [esi] ;Load Ax onto stack

fmul REAL4 PTR [edi+8] ;St(0)==Ax*Bz

fxch ;Xchange

fsub st(0),st(1)

fstp REAL4 PTR [edx] ;Store result into Result space

add edx,4

;Now proceed to last unit ofcross product

fld REAL4 PTR [esi] ;Load Ax onto stack

fmul REAL4 PTR [edi+8] ;st(0)==Ax*By

fld REAL4 PTR [esi+8]

fmul REAL4 PTR [edi] ;st(1)==Ay*Bx

fxch

fsub st(0),st(1)

fstp REAL4 PTR [edx]

fcompp

fstp cleanup

ret

GetCrossProduct endp

I ran a test just now and my TBIRD 900

did:

18,649,285 Cross Products a second!!

The weird thing is compare it to my DotProduct function and look:

My Cross Product calculation is faster than my Dot Product

HOW?!?!?!

This adds even more to my current state of shock.

:eek:

What the?!?!

I just did a test run of the FPU dot Product and it jumped all the way up to:

46,723,552

:eek:

But when I test both 3DNow! and FPU they both drop to 20 something million each.

weird.

I just did a test run of the FPU dot Product and it jumped all the way up to:

46,723,552

:eek:

But when I test both 3DNow! and FPU they both drop to 20 something million each.

weird.

Here are my results:

the function call overhead (+parameter passing).

```
GetDotProduct: 1000 loops, 21014 cycles
```

GetDP3DNow: 1000 loops, 32271 cycles

GetCrossProduct: 1000 loops, 30014 cycles

These results are very reproducable and they include
the function call overhead (+parameter passing).

My 3DNow! routine seems to be slower than my FPU routine, can you make a guess to what the problem may be?

Normally, in high-performance 3DNow! code, all of the 3DNow! instructions are properly scheduled apart from each other so as to avoid delays due to execution resource contentions (as well as taking into account dependencies and execution latencies).

There is a two cycle latency - look at the forward dependancies! Two dot products could be executed in parallel in the same number of cycles!```
GetDP3DNow PROC lpv1:DWORD, lpv2:DWORD
```

LOCAL tempans:DWORD

mov esi,lpv1

mov edi,lpv2

movq mm0,[esi] ;Fetch X and Y of both vectors

movq mm1,[edi]

pfmul mm0,mm1

movd mm3,[esi+8]

pfacc mm0,mm0

movd mm5,[edi+8][COLOR=red]

pfacc mm1,mm1 ; What is this for?[/COLOR]

pfmul mm3,mm5[COLOR=red]

pfacc mm3,mm3 ; Are you sure this is correct?[/COLOR]

pfadd mm0,mm3

movd tempans,mm0

femms

fld tempans

ret

GetDP3DNow endp

The pfacc's are to accumulate the two REAL4's in the registers. yes they are correct they work. Are they slowing it down? How would you do it?

Is it a good idea to accumulate the register into another 3DNow! register to eliminate dependancies.

MM1 is never used again after PFACC - how can that be right? :)

Please, take a look in a debugger with some test values.

Please, take a look in a debugger with some test values.

OK I did something wrong here is the fixed one (somewhat):

Here are the vectors I used

vec1 dd 0.984253f, 0.174552f, -0.027857f

vec2 dd -0.131588f, 0.618331f,-0.774823f

Here are the results displayed by OllyDBG:

FPU: -7.2829425334930419920e-07

3DNow!: -7.234874174001010380e-07

The accuracy is that acceptable? The difference? the precision?

```
```

LOCAL tempans:DWORD

mov esi,lpv1

mov edi,lpv2

movq mm0,[esi] ;Fetch X and Y of both vectors

movq mm1,[edi]

pfmul mm0,mm1

movd mm3,[esi+8]

pfacc mm0,mm0

movd mm5,[edi+8]

pfmul mm3,mm5

pfadd mm0,mm3

movd tempans,mm0

femms

fld tempans

ret

Here are the vectors I used

vec1 dd 0.984253f, 0.174552f, -0.027857f

vec2 dd -0.131588f, 0.618331f,-0.774823f

Here are the results displayed by OllyDBG:

FPU: -7.2829425334930419920e-07

3DNow!: -7.234874174001010380e-07

The accuracy is that acceptable? The difference? the precision?

LOL! Sorry BitRake I felt stupid :o