Hi all :)

Does anyone has (or knows where I can get) a 4x4 matrix multiply routine, optimized for speed? If not, could you give me any advices on how to make one myself?

Thanks in advance :)

Does anyone has (or knows where I can get) a 4x4 matrix multiply routine, optimized for speed? If not, could you give me any advices on how to make one myself?

Thanks in advance :)

This may be of help:

http://nebuladevice.cubik.org/documentation/nebula2/__matrix44__sse_8h-source.shtml

http://nebuladevice.cubik.org/documentation/nebula2/__matrix44__sse_8h-source.shtml

Hi QvasiModo

Check Demo11 of the ObjAsm32 package. NaN has written general purpose routines to handle matrix operations and there are quite optimized... :)

Regards,

Biterider

Check Demo11 of the ObjAsm32 package. NaN has written general purpose routines to handle matrix operations and there are quite optimized... :)

Regards,

Biterider

I have not looked at NaN's code - it's MMX?

I use the (rather generic) fpu macros from Caleb's DX8 includes, a file by the name of "d3dx8math_fkt.def"

It includes a bunch of very handy fpu related macros including 3x3 and 4x4 matrix operations.. worth a look :)

I use the (rather generic) fpu macros from Caleb's DX8 includes, a file by the name of "d3dx8math_fkt.def"

It includes a bunch of very handy fpu related macros including 3x3 and 4x4 matrix operations.. worth a look :)

Hi EvilHomer2k

The matrix.inc code was developed for real8 (double) floats, so the mmx technology is not suitable.

MMX is fine for fast and not so accurate computations like used in the DirectX stuff...

While we were writing the code, we focused on a general purpose code that would be as fast as possible. That?s why the LU computation was added.

Depending on the programming target, one way will be better than the other. :)

Regards,

Biterider

The matrix.inc code was developed for real8 (double) floats, so the mmx technology is not suitable.

MMX is fine for fast and not so accurate computations like used in the DirectX stuff...

While we were writing the code, we focused on a general purpose code that would be as fast as possible. That?s why the LU computation was added.

Depending on the programming target, one way will be better than the other. :)

Regards,

Biterider

Matrix code for real8? Nice, DirectX only uses real4, same for OGL to my best knowledge this is standard... Caleb's macros are all real4, but plainly coded and easy to modify for real8 - I assume NaN's code looks pretty similar to Caleb's..

If you'll be dealing with particular types of matrices then specialist inversion routines most probably exist and will almost certainly be faster.

For example 4x4 Matrices tend to be used in 3D math (hence OGL and DX being mentioned). If thats the area you're working in and your matrices are affine, meaning they're made up of combinations of translation, rotation and/or scaling matrices then you can use an affine matrix inversion algo, C++ code can be found at the bottom of http://cvs.sourceforge.net/viewcvs.py/tenebrae/tenebrae_2/mathlib.c?rev=1.3

Oh and btw ignore the fact that the code is under GPL, the bit where it says

For example 4x4 Matrices tend to be used in 3D math (hence OGL and DX being mentioned). If thats the area you're working in and your matrices are affine, meaning they're made up of combinations of translation, rotation and/or scaling matrices then you can use an affine matrix inversion algo, C++ code can be found at the bottom of http://cvs.sourceforge.net/viewcvs.py/tenebrae/tenebrae_2/mathlib.c?rev=1.3

Oh and btw ignore the fact that the code is under GPL, the bit where it says

*"Thnx to http://www.cs.unc.edu/~gotz/code/affinverse.html"*neglects to mention that the origional code was completly free. I also have an implementation of this in FPU code but its at home :(Thanks a lot for your replies! :)

@Sparafusile: Very interesting! :) I'll try to put that code in my benchmark program...

@Biterider & E?in: I'll check it out... :alright:

@Sparafusile: Very interesting! :) I'll try to put that code in my benchmark program...

@Biterider & E?in: I'll check it out... :alright:

This is what I have so far, tested on Pentium III...

```
```

Library: Matrix.dll

Procedure: MatrixMultiply

Repeat: 1000000 time(s)

Total time: 174282944 clock ticks, 78 ms.

Each run: 174 clock ticks

Library: pdixon.dll

Procedure: pdixon1

Repeat: 1000000 time(s)

Total time: 3225470300 clock ticks, 1343 ms.

Each run: 4294966227 clock ticks

Library: mlib.dll

Procedure: Mat_Mul_4x4_4x4

Repeat: 1000000 time(s)

Total time: 460865244 clock ticks, 188 ms.

Each run: 461 clock ticks

Library: matasm.dll

Procedure: Mat_Mult

Repeat: 1000000 time(s)

Total time: 169537948 clock ticks, 62 ms.

Each run: 170 clock ticks

I find it interesting that with the pdixon.dll one run took more clocks than the combination of all loops. That's an interesting trick :)

Btw, I'm using a P4.

Spara

I find it interesting that with the pdixon.dll one run took more clocks than the combination of all loops. That's an interesting trick :)

Btw, I'm using a P4.

Spara

Btw, I'm using a P4.

Spara

Oops! :oops:

I'll check that out...

Now it's working correctly. :)

```
Library: Matrix.dll
```

Procedure: MatrixMultiply

Repeat: 10000000 time(s)

Total time: 2318528161 clock ticks, 3559 ms.

Each run: 232 clock ticks

Library: pdixon.dll

Procedure: pdixon1

Repeat: 10000000 time(s)

Total time: 5348376136 clock ticks, 8211 ms.

Each run: 535 clock ticks

Library: mlib.dll

Procedure: Mat_Mul_4x4_4x4

Repeat: 10000000 time(s)

Total time: 4275011031 clock ticks, 6560 ms.

Each run: 428 clock ticks

Library: matasm.dll

Procedure: Mat_Mult

Repeat: 10000000 time(s)

Total time: 1684109204 clock ticks, 2584 ms.

Each run: 168 clock ticks

I must post a matrix routine :)

```
```

Here is a matrix multiply routine that multiplies a 4x4 matrix by a 4x1 vector, where both the matrix and vector elements are single precision floating-point numbers. This routine is probably close to optimal on a Pentium.

; 4x4 by 4x1 matrix multiply executes in 58 cycles

;

; input: esi = pointer to original 4x1 vector

; ebx = pointer to 4x4 transformation matrix

; output: edi = pointer to transformed 4x1 vector

; destroys: fp stack

MACRO F4x4M

fld [dword ptr esi] ; x

fld st(0) ; x, x

fmul [dword ptr ebx] ; x*a11, x

fld st(1) ; x, x*a11, x

fmul [dword ptr ebx+16] ; x*a12, x*a11, x

fld st(2) ; x, x*a12, x*a11, x

fmul [dword ptr ebx+32] ; x*a13, x*a12, x*a11, x

fxch st(3) ; x, x*a12, x*a11, x*a13

fmul [dword ptr ebx+48] ; x*a14, x*a12, x*a11, x*a13

fld [dword ptr esi+4] ; y, x*a14, x*a12, x*a11, x*a13

fld st(0) ; y, y, x*a14, x*a12, x*a11, x*a13

fmul [dword ptr ebx+4] ; y*a21, y, x*a14, x*a12, x*a11, x*a13

fld st(1) ; y, y*a21, y, x*a14, x*a12, x*a11, x*a13

fmul [dword ptr ebx+20] ; y*a22, y*a21, y, x*a14, x*a12, x*a11, x*a13

fld st(2) ; y, y*a22, y*a21, y, x*a14, x*a12, x*a11, x*a13

fmul [dword ptr ebx+36] ; y*a23, y*a22, y*a21, y, x*a14, x*a12, x*a11, x*a13

fxch st(3) ; y, y*a22, y*a21, y*a23, x*a14, x*a12, x*a11, x*a13

fmul [dword ptr ebx+52] ; y*a24, y*a22, y*a21, y*a23, x*a14, x*a12, x*a11, x*a13

fxch st(2) ; y*a21, y*a22, y*a24, y*a23, x*a14, x*a12, x*a11, x*a13

faddp st(6),st ; y*a22, y*a24, y*a23, x*a14, x*a12, x*a11+y*a21, x*a13

faddp st(4),st ; y*a24, y*a23, x*a14, x*a12+y*a22, x*a11+y*a21, x*a13

fxch st(1) ; y*a23, y*a24, x*a14, x*a12+y*a22, x*a11+y*a21, x*a13

faddp st(5),st ; y*a24, x*a14, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23

fld [dword ptr esi+8] ; z, y*a24, x*a14, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23

fxch st(1) ; y*a24, z, x*a14, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23

faddp st(2),st ; z, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23

fld st(0) ; z, z, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23

fmul [dword ptr ebx+8] ; z*a31, z, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23

fld st(1) ; z, z*a31, z, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23

fmul [dword ptr ebx+24] ; z*a32, z*a31, z, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23

fld st(2) ; z, z*a32, z*a31, z, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23

fmul [dword ptr ebx+40] ; z*a33, z*a32, z*a31, z, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23

fxch st(3) ; z, z*a32, z*a31, z*a33, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23

fmul [dword ptr ebx+56] ; z*a34, z*a32, z*a31, z*a33, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23

fxch st(2) ; z*a31, z*a32, z*a34, z*a33, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23

faddp st(6),st ; z*a32, z*a34, z+a33, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21+z*a31, x*a13+y*a23

faddp st(4),st ; z*a34, z*a33, x*a14+y*a24, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23

fxch st(1) ; z*a33, z*a34, x*a14+y*a24, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23

faddp st(5),st ; z*a34, x*a14+y*a24, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33

fld [dword ptr esi+12] ; w, z*a34, x*a14+y*a24, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33

fxch st(1) ; z*a34, w, x*a14+y*a24, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33

faddp st(2),st ; w, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33

fld st(0) ; w, w, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33

fmul [dword ptr ebx+12] ; w*a41, w, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33

fld st(1) ; w, w*a41, w, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33

fmul [dword ptr ebx+28] ; w*a42, w*a41, w, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33

fld st(2) ; w, w*a42, w*a41, w, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33

fmul [dword ptr ebx+44] ; w*a43, w*a42, w*a41, w, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33

fxch st(3) ; w, w*a42, w*a41, w*a43, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33

fmul [dword ptr ebx+60] ; w*a44, w*a42, w*a41, w*a43, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33

fxch st(2) ; w*a41, w*a42, w*a44, w*a43, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33

faddp st(6),st ; w*a42, w*a44, w*a43, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31+w*a41, x*a13+y*a23+z*a33

faddp st(4),st ; w*a44, w*a43, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32+w*a42, x*a11+y*a21+z*a31+w*a41, x*a13+y*a23+z*a33

fxch st(1) ; w*a43, w*a44, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32+w*a42, x*a11+y*a21+z*a31+w*a41, x*a13+y*a23+z*a33

faddp st(5),st ; w*a44, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32+w*a42, x*a11+y*a21+z*a31+w*a41, x*a13+y*a23+z*a33+w*a43

fxch st(3) ; x*a11+y*a21+z*a31+w*a41, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32+w*a42, w*a44, x*a13+y*a23+z*a33+w*a43

fstp [dword ptr edi] ; x*a14+y*a24+z*a34, x*a12+y*a22+z*a32+w*a42, w*a44, x*a13+y*a23+z*a33+w*a43

faddp st(2),st ; x*a12+y*a22+z*a32+w*a42, x*a14+y*a24+z*a34+w*a44, x*a13+y*a23+z*a33+w*a43

fstp [dword ptr edi+4] ; x*a14+y*a24+z*a34+w*a44, x*a13+y*a23+z*a33+w*a43

fxch st(1) ; x*a13+y*a23+z*a33+w*a43, x*a14+y*a24+z*a34+w*a44

fstp [dword ptr edi+8] ; x*a14+y*a24+z*a34+w*a44

fstp [dword ptr edi+12] ;

ENDM

I must post a matrix routine :)

Thanks, I'll try it right away :)

QvasiModo,

Just a question, are you multiplying a 4by4 matrix by a 4by4 matrix ?

BTW sorry I did not have time to try to code a SSE verison

Just a question, are you multiplying a 4by4 matrix by a 4by4 matrix ?

BTW sorry I did not have time to try to code a SSE verison

Yes, it's a 4x4 matrix by another 4x4. So I couldn't test

Here's a new version, and it seems I've screwed up the code, because they're not producing the same results. Probably some of this functions use a different representation for the matrix. :?:

I'll look into it tomorrow...

**>Matrix<**'s code, sorry :(Here's a new version, and it seems I've screwed up the code, because they're not producing the same results. Probably some of this functions use a different representation for the matrix. :?:

I'll look into it tomorrow...