If anyone is looking for a challenge, I have some source that will eventually be release as a package, but for now its in its "testing" stages. Its a Matrix class object with alot of FPU math mixed up with integer row/col calculations.

I think i got it going pretty smooth, but im sure there is pipeline pairing and other optomization tricks that will make it even better in performance. If your intersted i will set you up with a test package to work with (radasm).

Im most interested in seeing optomization in the following sources (all finished and working ~ just not certified as optomized for speed ):

- Matrix Multiply
- Matrix Add
- LU Factorize
- LU Solve (general method )
- LU SolveVector (specific method)
- RowReduce (gaussian elimination)

They are heavily commented so it should not be a struggle to follow what is going on. I just dont have the time to formally learn the pipeline pairing "rules" and other "tricks". (Took me long enough to devolope these routines).

Lemme know if your interested.
Posted on 2004-03-08 17:32:47 by NaN
Great. Where do I get the source if I want to participate in the challenge?

BTW, just in case you haven't done it yet (but I suspect you already did), take a look at level 3 BLAS and LAPACK. Also, related ACM TOMS articles.
Posted on 2004-03-08 20:40:06 by Starless
Ok, here you go. Please ask questions, as im sure you may have some.

Remember that this is designed for an OOP model, so there is 'wrapper' stuff in the mix. However its only at the beginning and end of the routines im asking someone to look at. They should not be of any real concern. Most of the focus should be at Parts 3.0 -> 5.0 as commented. This is the area with the raw math crunching. Part 0.0->2.0 does checking for matrix types and sizes and sets up for the crazy operations that follows. (look at Matrix_RowReduce/Matrix_LU_Decompose etc.)

There is alot of files as well here. They are all support files chopped out of the package. The can all be ingnored except the Matrix.inc (as this is where the good stuff is ;) ). As well the debugwindow.exe is provided as its designed to work with the OOP package.

One last thing. I found by trial and error that using the stack is far faster than allocating memory via heap/global alloc. As such there temporary memory used for FPU number crunching is allocated directly off the stack in these routines (and pointers saved).

If you have any troubles getting it working let me know.

Thanks alot for looking at this! As always, credits will be given for your help in the final release..
Posted on 2004-03-08 21:08:20 by NaN
Here is my first impression about Matrix.inc.

1. Unroll the critical loops.
From my own experience in the past, I found unrolling plays a great role in numerical computation like matrix algebra. I think unrolling will help you hiding the latency (on P6 and later anyway). Your code in Matrix.inc can be quite faster. Say, for multiplication, I posted dnrm2() long time ago. That may (or may not) be of help in this area.

2. Read TOMS.
I remember an article with tiltle containing RISC BLAS. Basically, the article is about implementing level 3 BLAS. If you care only about the speed, you may benefit from Strassen's algorithm.

I have not looked into LU decomposition yet, and if there is anything that I think may be helpful for speeding up, I'll post again.

BTW, why would you want to create a different inverse routine, when you have LU decomposition and you can use it any time?
Posted on 2004-03-08 23:06:28 by Starless
I had a quick look at your package, always interested with anything having to do with the FPU. There must be a few years of typing inside that package. The keyboard must be worn out!:tongue:

I haven't dealt with matrices for so long that I wouldn't be of much help unless I recycled myself. At this time, functions of complex numbers would have been easier to deal with.

I did however look at your fMath macros and I have a few comments.

For many of your functions, you have a preliminary note such as "st5, st6 and st7 lost". You can't loose the content of such registers unless you free them specificly. You would otherwise produce garbage, and generate an "invalid" exception, if you try to overwrite them if they are not free. And I haven't seen any code to free them.

If, on the other hand, you were to try and use your fUnload macro as written, you would loose the data in st0, st1, st2,... which data could still be valuable.

Some of the macros use the stack to store either the Status Word or the Control Word. I have always been under the impression that a 32-bit assembler will always keep the stack aligned on a dword boundary, i.e. if you push a word, a 0 word would also get pushed with it to make into a dword. If the above is correct, some of your stack references for previously pushed words may be wrong.

As written, the fCheckStatus macro would trash the last value pushed on the stack. That value must have had a specific purpose!!!

I don't know if and when you check for exceptions (primarily for the "Invalid operation" one). If you never clear the exceptions, either specificly with "fclex" or through "finit", any detected exception may not have been generated by the last operation. The exception flags of the Status Word are cumulative.

Looking at the fSgn macro, one assumes that there would never be the possibility of an invalid data in st0 since it does not use the setting of the PF (parity flag) to check whether the comparison was valid or not. The following snippet could also be used for the same purpose without using the FPU comparison and cmovx instructions. The only drawback is if you would ever expect valid REAL8 or REAL10 values on the FPU which would have an absolute value smaller than 10^(-45).

push eax ;reserve stack space
fst dword ptr[esp] ;the sign will always be
;the most significant bit
;regardless of the float size
fwait ;for safety
pop eax ;retrieve the float
shl eax,1 ;transfer the sign bit to the C flag
jz @F ;takes care of both -0 and +0
mov eax,0 ;does not affect current flags
sbb eax,0 ;-> -1 if st0 negative
or eax,1 ;-> +1 if st0 positive but doesn't
;change the -1 if it was negative
push eax ;store result on stack
ffree st(7) ;again for safety
fild dword ptr[esp] ;load it to the FPU
fwait ;for safety
pop eax ;clean stack
Posted on 2004-03-09 21:30:49 by Raymond
Heh, I didnt say it would be a walk in the park... only that i comment the crap out of sections i worked on. BiteRider did alot of the other work, including the Inverse routine. I just stepped up to continue what i saw as a good start.

As for the fMath marcros i will look in to this. You may have noticed im not useing them in my routines (LU stuff and RowReduce), so I havent really looked to closely at how they are written. But i assure you your point is not being overlooked here. The full package is alot bigger than what i have provided. I only cut & pasted what i thought would be required to get it to assemble for those who think it can help. As such its a quick "patch" job with it all lumped in one directory. The full package of objects is multiple directorys, examples, help files, etc. So again, dont tear into this background support files too much . For now im hoping to see the matrix methods sped up a few hundred clocks on a standard 3x3 matrix.

I was kicking around the idea of developing an Electrical Circuit analysis tool as an example using the model and this Matrix class object as its core "engine" for calculations. This is simple in theory to do once you have a good matrix tool to work with, but they will get big (ruffly for a N element circuit ~ you need a N.N matrix. If your using Capacitors / Inductors / Silicon devices it will become 2*N*N. So speed counts when the number of elements are increased.

I thank you for your interest, and I dont need it solved right away.. Take some time on it.. As it was pointed out ~ There is alot of code to look at ;)

I havent heard any comments on the mov/shl/imul/add chunks of code that is used over and over to set up Row/Column pointers/offsets. So i will assume this is probably the best choice to do such. (Neglecting pairing rules??)

As always, thanks again!
Posted on 2004-03-09 22:40:29 by NaN
For now im hoping to see the matrix methods sped up a few hundred clocks on a standard 3x3 matrix.

Really? For such a tiny matrix, the current code is just fine. I thought you would deal with something large. I usually deal with matrices of thousands by tens, and my thinking is biased towards such matrices.

So speed counts when the number of elements are increased.

And if you can get away with less accurate results, Strassen's algorithm is the way to go.

BTW, what is your target CPU again? When it comes to FPU, P5, P6, and Netburst cores are all different from each other. You cannot optimize for all. You know it very well. It seems to me that you target for P5, mentioning pairing rules. Right?
Posted on 2004-03-10 00:02:56 by Starless
Hello Raymond
Thanks for your comments! The intention of the fMath macros is to give to the programmer small code snippets to use universally but without checking of any sort. The programmer is responsible to do it.

I would try to clarify some comments:

1. When I write st5, st6, st7 lost, I mean that due to some FPU operations, the content of these registers, if you have used them, are lost.
2. The fUnload macro is intended to unroll the stack n positions if the first n values are no longer needed. Some demos are included in the package.
3. When I push some word values on the stack, it is done under some alignment and if I don?t change it, I?ll get the correct value when I pop it from the stack.
4. The fCheckStatus macro uses the stack to store transitory a value that is no longer needed after the execution macro. In this case, the FPU status word.
5. The detection of Exceptions lies on the programmer?s hand.
6. fsgn: you are right, the PF is not checked for an valid comparison. It is assumed, that the value in st0 is a valid FP number.


Posted on 2004-03-10 00:30:48 by Biterider
Biterider wrote:
The intention of the fMath macros is to give to the programmer small code snippets to use universally but without checking of any sort.
Code snippets are great as long as their limitations are clearly identified along with it (and not in separate documents). Otherwise, they are bound to be misused more often than not.

You may have misunderstood my comment related to the fCheckStatus macro. If used as written, the temporarily stored Status Word would still overwrite the last value pushed on the stack!!!! That one may NOT be a value that is no longer needed after the execution macro.

I still maintain that my comment about the st5, st6, st7 lost is correct. Those listed registers MUST BE EMPTY in order to use them. If they are EMPTY, they don't contain any data. Therefore, you cannot loose data from EMPTY registers. Going back to my above comment, the clarifying note should have been: st5, st6, st7 must be EMPTY.

Posted on 2004-03-10 11:56:06 by Raymond
Hello Raymond
Yesterday, I was playing around with the FPU the check your points and I must admit that you are right!
The indicated registers in the notes should be EMPTY. Concerning the stack problem, it is certainly a bug that is also in the fSetPrecision macro. I corrected these points in the new fMath.inc 1.0.2 file included in the attachment. Thanks! :alright:
If you have more corrections, please tell it to me. Thanks again...

Posted on 2004-03-11 00:43:38 by Biterider
Glad to be of help. I'll give your revised version a more detailed look.

Posted on 2004-03-11 10:49:07 by Raymond
My comments on the fmath.inc file Version: 1.0.2 are attached as a zipped text file.

Posted on 2004-03-11 20:10:56 by Raymond
Hello Raymond
I checked all your points and corrected the code.
Concerning the peculiar use of the stack, I don?t think that Windows overrides the stored values during an interrupt since it executes in kernel mode on a different stack. It is more possible that it would happen under DOS, but I?m absolutely sure, that some debuggers (MS) modify this memory place. Considering this, your point is absolutely correct!
About fExp2, I check the performance of my old macro and yours and found that although your macro has less FPU instructions, it is slower. At the moment I can not explain it (?)
I also check the performance of the fArcCsc macros. The result shows that your first macro is the fastest, but the last the slowest. The differences are neglectable.
Thanks for your support! :alright:

Posted on 2004-03-13 11:30:45 by Biterider