Hi all :)

Does anyone has (or knows where I can get) a 4x4 matrix multiply routine, optimized for speed? If not, could you give me any advices on how to make one myself?

Thanks in advance :)
Posted on 2004-11-01 10:49:31 by QvasiModo
Posted on 2004-11-01 12:21:06 by Sparafusile
Hi QvasiModo
Check Demo11 of the ObjAsm32 package. NaN has written general purpose routines to handle matrix operations and there are quite optimized... :)

Regards,

Biterider
Posted on 2004-11-02 00:17:57 by Biterider
I have not looked at NaN's code - it's MMX?
I use the (rather generic) fpu macros from Caleb's DX8 includes, a file by the name of "d3dx8math_fkt.def"
It includes a bunch of very handy fpu related macros including 3x3 and 4x4 matrix operations.. worth a look :)
Posted on 2004-11-03 02:08:15 by Homer
Hi EvilHomer2k
The matrix.inc code was developed for real8 (double) floats, so the mmx technology is not suitable.
MMX is fine for fast and not so accurate computations like used in the DirectX stuff...
While we were writing the code, we focused on a general purpose code that would be as fast as possible. That?s why the LU computation was added.
Depending on the programming target, one way will be better than the other. :)

Regards,

Biterider
Posted on 2004-11-03 03:41:39 by Biterider
Matrix code for real8? Nice, DirectX only uses real4, same for OGL to my best knowledge this is standard... Caleb's macros are all real4, but plainly coded and easy to modify for real8 - I assume NaN's code looks pretty similar to Caleb's..
Posted on 2004-11-03 06:29:56 by Homer
If you'll be dealing with particular types of matrices then specialist inversion routines most probably exist and will almost certainly be faster.

For example 4x4 Matrices tend to be used in 3D math (hence OGL and DX being mentioned). If thats the area you're working in and your matrices are affine, meaning they're made up of combinations of translation, rotation and/or scaling matrices then you can use an affine matrix inversion algo, C++ code can be found at the bottom of http://cvs.sourceforge.net/viewcvs.py/tenebrae/tenebrae_2/mathlib.c?rev=1.3

Oh and btw ignore the fact that the code is under GPL, the bit where it says "Thnx to http://www.cs.unc.edu/~gotz/code/affinverse.html" neglects to mention that the origional code was completly free. I also have an implementation of this in FPU code but its at home :(
Posted on 2004-11-03 08:02:48 by Eóin
Thanks a lot for your replies! :)

@Sparafusile: Very interesting! :) I'll try to put that code in my benchmark program...

@Biterider & E?in: I'll check it out... :alright:
Posted on 2004-11-04 11:04:02 by QvasiModo
This is what I have so far, tested on Pentium III...
Posted on 2004-11-04 13:57:58 by QvasiModo


Library: Matrix.dll
Procedure: MatrixMultiply
Repeat: 1000000 time(s)
Total time: 174282944 clock ticks, 78 ms.
Each run: 174 clock ticks

Library: pdixon.dll
Procedure: pdixon1
Repeat: 1000000 time(s)
Total time: 3225470300 clock ticks, 1343 ms.
Each run: 4294966227 clock ticks

Library: mlib.dll
Procedure: Mat_Mul_4x4_4x4
Repeat: 1000000 time(s)
Total time: 460865244 clock ticks, 188 ms.
Each run: 461 clock ticks

Library: matasm.dll
Procedure: Mat_Mult
Repeat: 1000000 time(s)
Total time: 169537948 clock ticks, 62 ms.
Each run: 170 clock ticks


I find it interesting that with the pdixon.dll one run took more clocks than the combination of all loops. That's an interesting trick :)

Btw, I'm using a P4.

Spara
Posted on 2004-11-04 15:13:12 by Sparafusile
I find it interesting that with the pdixon.dll one run took more clocks than the combination of all loops. That's an interesting trick :)

Btw, I'm using a P4.

Spara

Oops! :oops:
I'll check that out...
Posted on 2004-11-05 13:57:06 by QvasiModo
Now it's working correctly. :)
Library:    Matrix.dll

Procedure: MatrixMultiply
Repeat: 10000000 time(s)
Total time: 2318528161 clock ticks, 3559 ms.
Each run: 232 clock ticks

Library: pdixon.dll
Procedure: pdixon1
Repeat: 10000000 time(s)
Total time: 5348376136 clock ticks, 8211 ms.
Each run: 535 clock ticks

Library: mlib.dll
Procedure: Mat_Mul_4x4_4x4
Repeat: 10000000 time(s)
Total time: 4275011031 clock ticks, 6560 ms.
Each run: 428 clock ticks

Library: matasm.dll
Procedure: Mat_Mult
Repeat: 10000000 time(s)
Total time: 1684109204 clock ticks, 2584 ms.
Each run: 168 clock ticks
Posted on 2004-11-06 21:43:03 by QvasiModo
I must post a matrix routine :)



Here is a matrix multiply routine that multiplies a 4x4 matrix by a 4x1 vector, where both the matrix and vector elements are single precision floating-point numbers. This routine is probably close to optimal on a Pentium.
; 4x4 by 4x1 matrix multiply executes in 58 cycles
;
; input: esi = pointer to original 4x1 vector
; ebx = pointer to 4x4 transformation matrix
; output: edi = pointer to transformed 4x1 vector
; destroys: fp stack

MACRO F4x4M
fld [dword ptr esi] ; x
fld st(0) ; x, x
fmul [dword ptr ebx] ; x*a11, x
fld st(1) ; x, x*a11, x
fmul [dword ptr ebx+16] ; x*a12, x*a11, x
fld st(2) ; x, x*a12, x*a11, x
fmul [dword ptr ebx+32] ; x*a13, x*a12, x*a11, x
fxch st(3) ; x, x*a12, x*a11, x*a13
fmul [dword ptr ebx+48] ; x*a14, x*a12, x*a11, x*a13
fld [dword ptr esi+4] ; y, x*a14, x*a12, x*a11, x*a13
fld st(0) ; y, y, x*a14, x*a12, x*a11, x*a13
fmul [dword ptr ebx+4] ; y*a21, y, x*a14, x*a12, x*a11, x*a13
fld st(1) ; y, y*a21, y, x*a14, x*a12, x*a11, x*a13
fmul [dword ptr ebx+20] ; y*a22, y*a21, y, x*a14, x*a12, x*a11, x*a13
fld st(2) ; y, y*a22, y*a21, y, x*a14, x*a12, x*a11, x*a13
fmul [dword ptr ebx+36] ; y*a23, y*a22, y*a21, y, x*a14, x*a12, x*a11, x*a13
fxch st(3) ; y, y*a22, y*a21, y*a23, x*a14, x*a12, x*a11, x*a13
fmul [dword ptr ebx+52] ; y*a24, y*a22, y*a21, y*a23, x*a14, x*a12, x*a11, x*a13
fxch st(2) ; y*a21, y*a22, y*a24, y*a23, x*a14, x*a12, x*a11, x*a13
faddp st(6),st ; y*a22, y*a24, y*a23, x*a14, x*a12, x*a11+y*a21, x*a13
faddp st(4),st ; y*a24, y*a23, x*a14, x*a12+y*a22, x*a11+y*a21, x*a13
fxch st(1) ; y*a23, y*a24, x*a14, x*a12+y*a22, x*a11+y*a21, x*a13
faddp st(5),st ; y*a24, x*a14, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23
fld [dword ptr esi+8] ; z, y*a24, x*a14, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23
fxch st(1) ; y*a24, z, x*a14, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23
faddp st(2),st ; z, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23
fld st(0) ; z, z, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23
fmul [dword ptr ebx+8] ; z*a31, z, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23
fld st(1) ; z, z*a31, z, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23
fmul [dword ptr ebx+24] ; z*a32, z*a31, z, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23
fld st(2) ; z, z*a32, z*a31, z, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23
fmul [dword ptr ebx+40] ; z*a33, z*a32, z*a31, z, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23
fxch st(3) ; z, z*a32, z*a31, z*a33, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23
fmul [dword ptr ebx+56] ; z*a34, z*a32, z*a31, z*a33, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23
fxch st(2) ; z*a31, z*a32, z*a34, z*a33, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21, x*a13+y*a23
faddp st(6),st ; z*a32, z*a34, z+a33, x*a14+y*a24, x*a12+y*a22, x*a11+y*a21+z*a31, x*a13+y*a23
faddp st(4),st ; z*a34, z*a33, x*a14+y*a24, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23
fxch st(1) ; z*a33, z*a34, x*a14+y*a24, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23
faddp st(5),st ; z*a34, x*a14+y*a24, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33
fld [dword ptr esi+12] ; w, z*a34, x*a14+y*a24, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33
fxch st(1) ; z*a34, w, x*a14+y*a24, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33
faddp st(2),st ; w, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33
fld st(0) ; w, w, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33
fmul [dword ptr ebx+12] ; w*a41, w, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33
fld st(1) ; w, w*a41, w, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33
fmul [dword ptr ebx+28] ; w*a42, w*a41, w, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33
fld st(2) ; w, w*a42, w*a41, w, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33
fmul [dword ptr ebx+44] ; w*a43, w*a42, w*a41, w, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33
fxch st(3) ; w, w*a42, w*a41, w*a43, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33
fmul [dword ptr ebx+60] ; w*a44, w*a42, w*a41, w*a43, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33
fxch st(2) ; w*a41, w*a42, w*a44, w*a43, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31, x*a13+y*a23+z*a33
faddp st(6),st ; w*a42, w*a44, w*a43, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32, x*a11+y*a21+z*a31+w*a41, x*a13+y*a23+z*a33
faddp st(4),st ; w*a44, w*a43, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32+w*a42, x*a11+y*a21+z*a31+w*a41, x*a13+y*a23+z*a33
fxch st(1) ; w*a43, w*a44, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32+w*a42, x*a11+y*a21+z*a31+w*a41, x*a13+y*a23+z*a33
faddp st(5),st ; w*a44, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32+w*a42, x*a11+y*a21+z*a31+w*a41, x*a13+y*a23+z*a33+w*a43
fxch st(3) ; x*a11+y*a21+z*a31+w*a41, x*a14+y*a24+z*a34, x*a12+y*a22+z*a32+w*a42, w*a44, x*a13+y*a23+z*a33+w*a43
fstp [dword ptr edi] ; x*a14+y*a24+z*a34, x*a12+y*a22+z*a32+w*a42, w*a44, x*a13+y*a23+z*a33+w*a43
faddp st(2),st ; x*a12+y*a22+z*a32+w*a42, x*a14+y*a24+z*a34+w*a44, x*a13+y*a23+z*a33+w*a43
fstp [dword ptr edi+4] ; x*a14+y*a24+z*a34+w*a44, x*a13+y*a23+z*a33+w*a43
fxch st(1) ; x*a13+y*a23+z*a33+w*a43, x*a14+y*a24+z*a34+w*a44
fstp [dword ptr edi+8] ; x*a14+y*a24+z*a34+w*a44
fstp [dword ptr edi+12] ;
ENDM
Posted on 2004-11-07 06:21:55 by >Matrix<
I must post a matrix routine :)

Thanks, I'll try it right away :)
Posted on 2004-11-07 11:11:00 by QvasiModo
QvasiModo,

Just a question, are you multiplying a 4by4 matrix by a 4by4 matrix ?

BTW sorry I did not have time to try to code a SSE verison
Posted on 2004-11-07 11:45:36 by roticv
Yes, it's a 4x4 matrix by another 4x4. So I couldn't test >Matrix<'s code, sorry :(

Here's a new version, and it seems I've screwed up the code, because they're not producing the same results. Probably some of this functions use a different representation for the matrix. :?:

I'll look into it tomorrow...
Posted on 2004-11-07 21:03:07 by QvasiModo
i found this on amd.com

optimized matrix routine is on page
Posted on 2004-11-13 13:10:27 by >Matrix<