Hello everyone, this is a 'little' macro package that allows you to work with m x n matrices.

I actually conceived this as a replacement for MatLab.... :grin: also I have to do a research project on DSP and this is intended to be part of it - I am more comfortable programming in ASM than in MatLab he he he....

But anyway... it is far from finished... The "manual" (cough, cough) is rather tough to read as yet, but you should get the general gist of how the macros (should) work.

Also... you can try reading the source code...

But anyway:
Include the file matasm.inc. If you need functions, include matfunc.inc. Incidentally, if you wish, you can load matrices using LoadMatrixFromFunction... Also, you prolly want to print out the function, so include vkprint.inc, then matprint.inc. vkprint.inc will mean that the @PrintMatrix function will print to vkim's debug window. Later on I plan to have my own output window just for MatAsm, and possibly a matrix graphing program too.

Incidentally I am using an older version of vkim's debug window so you should prolly reassemble the example file matasmbg.asm

MATRICES are memory allocated using GlobalAlloc as GMEM_FIXED. This is done at init time. You MUST include the include file 'matasmin.inc' AFTER all MatAsm macro code. Note that you should NOT define the matrices, you should just use any name in the MatAsm code - the MatAsm code will define them for you. At init time, call MatAsmInitialization, then at de-init time call MatAsmUnInitialization. If you want to track where errors occur, use the @Marker macro. You just place @Marker ## and the MatAsm commands will show that marker when they encounter an error.

The matrices have the ff format:
offset 0 = number of rows
offset 4 = number of columns
offset 8 = the data. Each element is a 4-byte FLOAT data piece.

To initialize an m x n matrix:
ex:
;Matrix1=
;1 2 3 4
;5 6 7 8
;4 3 2 1
LoadRowMatrix Matrix1, 1.0, 2.0f, 3.0f, 4.0f
ConcatenateByRow Matrix1, $RowMatrix (5.0f, 6.0f, 7.0f, 8.0f) ConcatenateByRow Matrix1,$RowMatrix (4.0f, 3.0f, 2.0f, 1.0f)

Looping instruction:
ForMatrix <MatrixToLoop>, (<OfWhatMatrix>)
....
EndForMatrix

MatrixToLoop will equal all of OfWhatMatrix. So if you do:
ForMatrix Temp,ROWS(Matrix1) ;assuming Matrix1 is initialized above
@PrintMatrix Temp
EndForMatrix

the program will output each row in Matrix1.

NOTE! Empty, null matrices may fail/crash certain commands, this is still under development.

Also, it seems that MASM, in and of itself, has trouble with things like -1.0f, you should just put -1.0 (without the f), but of course then you gotta remember the decimal point...

Posted on 2003-03-25 04:13:28 by AmkG
Updated version: fixed bugs with the @PrintMatrix macro, added TransposeMatrix.

Documentation still sucks terrible.

I wonder if anyone is interested...?

Well I'd still need it for my research project (maybe).

Posted on 2003-03-26 02:14:02 by AmkG
Matlab replacement. Impressive. :alright:

I don't know the design decision you made, so my questions may be totally stupid. But, I want to hear about your design decision.

1. Why do you use fsincos when you only need sin or cos? And, how do you make sure that the input value is within the accepted range of fsincos?

2. Why do you store st(0) to memory when you just want to discard st(0)? This can be done by fstp st(0), which is shorter and faster. Likewise, why do you fxch/fstp mem when you can make things shorter and faster by fstp st(1)?

3. What is the purpose of Function_RAMP? It seems to me that this function is not implemented fully. It just multiplies two args. There's got to be something more than that to justify it as a function.

4. Why do you ftst when you can make it faster using integer instructions in Function_UNITSTEP? Agner's note gives an example about this. That is, your code can be written as

mov eax,time
jz SetReturn
...

Of course, you can completely eliminate Jcc with a little bit of effort.

5. Why do you use GlobalAlloc() and friends when SDK documents discourage the use of them?

6. Some macros are a bit large to be implemented in macro. And it would be nice to clarify the register usage especially when a macro destroyes edi, esi, and/or ebx.
Posted on 2003-03-26 17:17:17 by Starless
Starless,

It's absolutely naive code. :alright: :rolleyes: ;)

I have no idea on the limits of FSINCOS... But it works so far with limits of 0 to 4*pi.

And IIRC (not that I've programmed with the FPU since the 80287) the FSIN and FCOS (do they even exist??? - can't remember) instructions are more limited in their accepted input.

And yeah I guess some macros are better off as wrappers around external routines... I'll look into that...

I didn't know about fstp st(1)... dang, that's possible????

The 'RAMP' function is a 'ramp'. Actually I forgot about negative values of time: it should be zero there. At time > 0, a ramp function is simply a line that starts at y=0, x (or time) =0. So in fact its ARGS(m), m is the SLOPE of the function. Since a line is y=mx + b, and b=0 (since at x=0, y=b=0), RAMP is simply y=mx, a simple multiplication of two numbers, a 'constant,' m, provided by the programmer and X, the value our LoadMatrixFromFunction macro changes.

So basically RAMP should be 0 at t<0, and should be equal to t times its slope.

Also... rather unfortunately, you can be reasonably sure that all macros trash esi. Most also trash edi, and some trash eax, ecx, and edx. ebx is not often used but TransposeMatrix trashes it too. Prolly should include in the docs what macros trash what registers. But anyway, the problem is that I MUST use GlobalReAlloc to change the dimensions of the matrices, so I sometimes need esi, edi, and ebx, and of course GlobalReAlloc trashes all other registers....

Thanks for taking the time to look and see...

P.S. As for that register trashing thing.... I completely forgot about it and originally wrote TransposeMatrix to use esi and ebx in the ForMatrix loop.... unfortunately ForMatrix trashes esi... so I had a bug... but it's been fixed, he he he. I prolly really SHOULD document what macros don't trash what registers....

P.P.S.
BTW I thought up a way of doing DSP correlation using matrices...
Say we have a two row matrices representing our signals.
We transpose one of these matrices and multiply them (matrically, of course, using matrix rules, not the multiply by element...). The middle downward ( \ ) diagonal of the product matrix, when summed, will equal the correlation with no shifting. The diagonal below it will be equivalent to shifting the transposed matrix left by one, while the diagonal above the middle diagonal will be equivalent to shifting the transposed matrix right by one. I wonder if this is good or if it is better to write my own ASM routine to perform correlation.
Posted on 2003-03-26 18:52:41 by AmkG
What are the things GlobalReAlloc() can do that HeapReAlloc() cannot do? I'm totally in the dark about the merit of Global*(). I use it only when APIs require them, e.g. clipboard operation. So, I guess you could tell me a little bit more about the merit of Global*() over Heap*().

So, you basically want tr(X'Y) for the correlation. I don't know what you mean by 'row matrix'. Did you mean 'row vector'? If so, the answer is trivial. Your 'non-shifted' version is simple $\sum_i (x_i * y_i)$ (Darn, I wish web pages can render TeX markups.) Shifted versions can be obtained just by shifting the starting points. -- When in doubt, write down the matrix X'Y, and confirm it.

If you meant general m by n matrices but 'row' got in the way inadvertantly...
You know, the matrix-matrix multiplication as in calculus books is O(n^3) procedure. So, if you can avoid it, by all means, do so. (Of course, if you don't mind the accuracy, you could go for Strassen algorithm.)
Posted on 2003-03-27 19:25:41 by Starless
Great work! Keep up the effort...

I would lean towards Heap objects personally (as long as they are under 4k each). But this is a lesser point... still good work!

I look forward to your future revisions ;)
:alright:
NaN
Posted on 2003-03-27 21:38:09 by NaN
Well I've been currently having problems with the darned Global** functions: see http://www.asmcommunity.net/board/index.php?topic=11868.

Perhaps the only merit they have is simplicity - I needed a simple way of allocating memory with a granularity of 4 bytes, without worrying too much about things like virtual addresses or heaps. Currently with the problems I am encountering with the Global* things, I am considering writing a wrapper MatAlloc etc. procedures to replace Global* with either Heap* or Virtual* functions. I hope that the problems I encounter in the things above don't come up again when using Heap or Virtual. Aaargh.

My other problem is that potentially matrices can be very large so I *might* exceed 4K.... especially since I intend to experiment with wavelets, whose Continuous transforms tend to contain lotsa redundant info. (although of course when using Discrete wavelet transforms my transformed data will be no larger than the original time-domain data, but of course I want to start with Continuous wavelet transforms).

Starless,
yeah row matrix = row vector. Well I would assume that a 'row matrix' = row vector, if you look at MatAsm itself, you see I say 'LoadRowMatrix' '\$RowMatrix' not vector ha ha ha....

As for 'normal' matrix multiplication of an m x n and n x o matrix, I believe I have the algorithm correct, just having problems with Global* allocation things as per in the above linked thread.

NaN and Starless,
Thanks for the support. And don't worry, you guys prolly have had more time to program/study Win32 assembly than I have.

Sincerely,
AmkG
Posted on 2003-03-27 23:13:38 by AmkG
Another nice feature would be to save and load your matrix's (in a compressed binary format). No need to waste disk space for 1000 zeros in a large matrix ;)

:NaN:
Posted on 2003-03-27 23:18:41 by NaN

Another nice feature would be to save and load your matrix's (in a compressed binary format). No need to waste disk space for 1000 zeros in a large matrix ;)

:NaN:

Yes I will have to implement something like that for my intended matrix plotter: Since the matrix plotter is intended to be a separate application (just as the VkDebug window is a separate application), I need a way of sending lotsa data to another app. The easiest is to save the matrix as a file, and run the matrix plotter application as an app with the matrix file as an argument. Prolly the matrix file should also include the name of the displayed matrix. Perhaps the compressed form will have to come later on.

P.S.
I have tried using wrapper functions for Heap* of my own to try and solve the problem I mentioned in relation to memory allocation. Mat* functions use Heap* functions (see extproc\matalloc.asm). Unfortunately the exact same problem exists, it crashes at just about the exact same point (at freeing time - at all times, why freeing time???????), what do you guys think I should do? Go down to bare-metal Virtual*? Maybe get each MatAlloc a new heap to work with? Jump off the top of a building? I can't get past MultiplyMatrices with this, aaaaaarrrrgggh why does memory allocation suck so much.

Maybe I should just ditch MultiplyMatrices and go on with writing my own CorrelateMatrices macro... but then MatAsm would not be 'complete'.
Posted on 2003-03-27 23:55:35 by AmkG
Well here is the latest version, the demo program matasmbg.asm still crashes (at UNINITIALIZATION time, of all times!!!) with the MultiplyMatrices thing.

Prolly with this type of memory allocation scheme I can assume that eventually as more and more complex matrix operations are performed, maybe the thing will crash at REALLOC time, not just freeing time...

Unless of course I just HeapDestroy without doing HeapFree.... Would this help I wonder?
Posted on 2003-03-28 01:07:44 by AmkG
Wild guess about crashing: (Well, sort of experience-based :) )

If you see crashing at *Free() time, that means *Free() somehow receives an invalid pointer (most likely NULL). This suggests that you might have made some mistakes in bookkeeping all *ReAlloc()'ed mem blocks.

IIRC, you can HeapDestroy() without HeapFree() and that is perfectly fine per SDK documentation. And, that will solve the crash problem. But, if you actually made mistakes in tracking mem blocks HeapReAlloc()'ed, then that suggests there might be memory leak. If the matrix size gets large, this may pose a problem at run time (ENOMEM).

BTW, what about removing old versions from the previous postings? It seems that people don't bother to go down to look for a bug-fixed/feature-extended version.
Posted on 2003-03-28 02:19:16 by Starless
I gave a quick look at your lastest version.

1) Your HeapCreate flag "0" doesnt exist as an option:
HEAP_NO_SERIALIZE equ 00000001h

HEAP_GROWABLE equ 00000002h
HEAP_GENERATE_EXCEPTIONS equ 00000004h
HEAP_ZERO_MEMORY equ 00000008h
HEAP_REALLOC_IN_PLACE_ONLY equ 00000010h
HEAP_TAIL_CHECKING_ENABLED equ 00000020h
HEAP_FREE_CHECKING_ENABLED equ 00000040h
HEAP_DISABLE_COALESCE_ON_FREE equ 00000080h
HEAP_CREATE_ALIGN_16 equ 00010000h
HEAP_CREATE_ENABLE_TRACING equ 00020000h
HEAP_MAXIMUM_TAG equ 0FFFh
HEAP_PSEUDO_TAG_FLAG equ 8000h
HEAP_TAG_SHIFT equ 18

In your MATASMIN.INC file your have some odd initialization routines.. Why your wrapping procs in a macro i dont know... Your most likely haveing an error around here properly storing your returned heap pointers.... I dunno havent fully studies your work..

:NaN:
Posted on 2003-03-28 10:05:37 by NaN

I gave a quick look at your lastest version.

1) Your HeapCreate flag "0" doesnt exist as an option:
HEAP_NO_SERIALIZE equ 00000001h

HEAP_GROWABLE equ 00000002h
HEAP_GENERATE_EXCEPTIONS equ 00000004h
HEAP_ZERO_MEMORY equ 00000008h
HEAP_REALLOC_IN_PLACE_ONLY equ 00000010h
HEAP_TAIL_CHECKING_ENABLED equ 00000020h
HEAP_FREE_CHECKING_ENABLED equ 00000040h
HEAP_DISABLE_COALESCE_ON_FREE equ 00000080h
HEAP_CREATE_ALIGN_16 equ 00010000h
HEAP_CREATE_ENABLE_TRACING equ 00020000h
HEAP_MAXIMUM_TAG equ 0FFFh
HEAP_PSEUDO_TAG_FLAG equ 8000h
HEAP_TAG_SHIFT equ 18

I didn't know HeapCreate had THOSE options... The documentation in msdn said only HEAP_GENERATE_EXCEPTIONS and HEAP_ZERO_MEMORY were available for HeapCreate... Since I didn't really need to have it zeroed I left it as 0.

I'll look into this.

My mistake... HeapCreate has only HEAP_NO_SERIALIZE and HEAP_GENERATE_EXCEPTIONS according to this site: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/memory/base/heapcreate.asp I don't wanna fool around with the rest...

In your MATASMIN.INC file your have some odd initialization routines.. Why your wrapping procs in a macro i dont know... Your most likely haveing an error around here properly storing your returned heap pointers.... I dunno havent fully studies your work..

:NaN:

There is a reason for that:
The package has to work with a variable number of matrices, of varying sizes. The 'varying sizes' is provided by the memory allocation things, Heap/GlobalReAlloc, etc. For the 'variable number' of matrices, however, I encountered a problem.

For one thing the target of the packages is for those who are not really interested in assembly (okay, so no one who is not interested in assembly would be interested in this anyway, but I always treat users that way, you know?). So the user will not want to have to do:

.data?
Matrix1 dd ?
.code
InitializeMatrix Matrix1

In fact, I wanted to make sure that the programmer can use any variable name for the program, kinda like in BASIC (old-style BASIC at least). To do this, the package maintains a comma-separated list of matrices, _ListOfMatrices_.

Unfortunately, in my experimentation, I found that:
FOR z_z,<%_ListOfMatrices_>
...
ENDM

Will NOT go through the list of matrices. It will assign the WHOLE of _ListOfMatrices_ in z_z, commas and all.

What DID work, was to wrap the thing in a macro, call it in FOR z_z,<%_ListOfMatrices_>, then, inside the macro, do another FOR z_z,<Entry>, which does loop through all the matrices in the list. That is the only reason it's wrapped in a macro. The macro won't work with %_ListOfMatrices_ directly, you really do need it in a loop. What can I say, it's a work-around.

Anyway I don't think I'll have much time, our advisor told me that MatLab is still better since all the tools I need were already there...:rolleyes: well what's the point of research if I don't get to build my own tools??? Well I STILL want to work with MatAsm... aaww.... :mad: :tongue: :alright: But of course my research takes priority.

Starless,
Unfortunately I can't seem to remove old attachments....

I wonder if a kind moderator can do this for me?

Sincerely,
AmkG

OKAY, OKAY, I kind moderator DID do this for me already. Thanks lots bitRAKE

P.S.
Starless,

I've added some checks into my MatAlloc set to put up some message boxes in case a Heap* routine returns 0. They don't trigger, ergo, they don't return a null pointer at any time.

Someone has suggested that my _Mat_Mult_ routine is clobbering memory outside the matrix; it is possible that it accidentally clobbers some system-maintained data structures... Unfortunately I don't have a debugger so I cannot really track my _Mat_Mult_ routine... I will have to rely on VkDebug, and for some reason, even VKDebug can sometimes fail with its GlobalAlloc when it gets this program. Damn weird thing.
Posted on 2003-03-30 20:41:11 by AmkG
It's not my _MatMult_ routine.

It seems to limit its clobbering within the limits of the output matrix - at least, according to outputting of VK's PrintHex, it limits its edi within the allocated space of the output matrix.

The MatReAlloc etc. routines also now check for a zero return, and are supposed to pop up some message box saying MATRIX ALLOCATION ERROR but the message box never pops up. And yeah I know I'm doing NULL as hWnd to the MessageBox, it's perfectly legal and I've always used it for such programs that don't have their own window. Unless that really screws up when you just called a Heap* routine.

Anyway I've taken out the MatFree calls at UnInitialization time, and just keep my fingers crossed that the problem doesn't reach ReAlloc time.

Sincerely,
AmkG

ps. I'm now working on signal Correlation, so that there is an *incomplete* matcorrl.asm in the extproc directory. Wish me luck.
Posted on 2003-03-31 00:57:35 by AmkG
if you don't have a debugger, go fetch one. OllyDbg is small and free, so... go go go! ;)

If you're having crashes at *Free() time without null pointers, then yes, you're very likely clobbering memory somewhere. Using VirtualAlloc would probably "fix" this, as all allocations are done in 4k increments, and you probably won't use that much memory, but... that wouldn't really be a fix. Find your bug :)

Using your own Mat* wrappers around the "real" allocation routines is a pretty good idea (if not doing it this way already, you might want to do it as macros, though). Let's you easily play around with other allocation methods, like IMalloc, custom allocation routines, etc.

HeapAlloc is probably the best generic allocator. VirtualAlloc has quite some overhead and can't easily ReAlloc, so it's best for one-time large-memory allocation, or where you need the control that VirtualAlloc gives. Global/Local* are deprecated. I wrote some stuff about it all a while ago, at http://f0dder.has.it , "Memory allocation ramblings". There's probably a few inaccauracies in there, and from a quick skim through it, it would appear I have made a bunch of updates to the document that haven't made it from my harddrive to the internet - guess it's time to update my website soon. In particular, I have stated that "GlobalAlloc returns a direct pointer on win32" - this might not hold true if you use the "GMEM_MOVEABLE" flag. Default is GMEM_FIXED though. Well, before I ramble too much here, I guess I should have a look at the mess at home and actually update the article. It might very well be that you have to use Global* with clipboard, DDE, (etc) as PSDK says, as even though they internally use Heap* on NT, special flags are passed along.

My bottom line, I guess: use Heap* for generic stuff, other routines where you have to, and look into implementing your own memory allocators. James Thorpe, http://www.thorpeweb.com/, has written some stuff that has a nice speed increase compared to standard (win32, libc, whatever) allocation routines. For specific needs, you could write even more specialized routines, and go even faster.
Posted on 2003-03-31 01:14:56 by f0dder
Originally posted by f0dder
James Thorpe, http://www.thorpeweb.com/ has written some stuff that has a nice speed increase compared to standard (win32, libc, whatever) allocation routines. For specific needs, you could write even more specialized routines, and go even faster.

I'm just curious. The source code looks like old MSC (or early days of VC) malloc() and friends, esp., in Windows 3.1 days. If you remember it, (I'm not sure if MS did the same thing with the newer VC products. I don't know just because I don't have a copy.) MS created two versions of memory management: near heap version and far heap version. IIRC, the far heap was used to allocate a big chunk and the near heap was used to manage small blocks. And the near heap management was quite similar to the linked code.

Now I wonder where the (claimed) speed gain in the linked code comes from. Maybe from a faster implementation of MT lock? I don't know. It does not seem to me that the gain comes from the pure memory allocation management side, because the code does not call mem mgmt APIs directly, but wraps malloc() and friends in libc (whatever the target compiler may be).

Then, what is the source of the claimed speed gain? :confused:
Posted on 2003-03-31 03:08:11 by Starless
I would assume part of it is from his own MT locking.

Why shouldn't wrapping malloc() be faster than using "native" mmgr calls anyway? It's possible to implement smarter (well, depending which libc you use) allocation schemes ontop of malloc, like quicker free node lookups, (yadda yadda). I haven't looked very much into how thorpe does his stuff, I've only played around with simple stuff like substituting malloc with heapalloc and such. His test app does show a ~3x improvement over standard libc malloc() with vs.net. I haven't tested it against raw Heap*, that might prove interesting too.
Posted on 2003-03-31 03:19:17 by f0dder
Originally posted by f0dder
Why shouldn't wrapping malloc() be faster than using "native" mmgr calls anyway?

Because APIs will be called in the end? -- This was the MS's argument when soliciting Unix developers to NT platform.

Anyway, if MS did not change the internal working of malloc() since VC4, then another reason is "double admin overhead". (I just coined this.) What I mean is that the similar memory block search routines are used -- well, libc one is not likely to be used, but there is one Jcc anyway. And another call/ret overhead. So, at best, wrapping malloc() can be par with libc malloc() but I don't think the wrapper outperform it without outside help like another MT lock.

Aside, his MT lock does not look MT-safe. Does cmpxchg come with implicit lock prefix? I don't remember reading something like that in Intel's document. I do remember that cmpxchg can be used with lock.
Posted on 2003-03-31 03:33:02 by Starless
additional call/ret overhead should be minimal compared to the rest of the work that has to be done - but of course would be nice to eliminate. "double admin overhead" could be considerably worse, and I would personally base a custom heap manager on something more lowlevel than libc malloc - and it was easy changing thorpe's xmalloc to do this. It didn't seem to cause any performance difference switching libc malloc to raw HeapAlloc though. Even lower level (VirtualAlloc) might be different, but it would require a whole lot more coding to make it work.

Stuff also depends on the goal you're trying to achieve. A fast generic allocator? Or perhaps you want to avoid heap fragmentation? Perhaps you have very specific demands, like tons of small allocations, few large, constantly freeing and reallocating, etc.

So, at best, wrapping malloc() can be par with libc malloc() but I don't think the wrapper outperform it without outside help like another MT lock.

Really depends on how libc does it and what optimizations you're doing. I haven't looked much at thorpe's version, and I haven't looked into ms libc implementation at all - I guess I should dig into it, might prove interesting. But my idea here is that if you're only falling back to libc malloc to allocate "a few large chunks" of which you do your own management, of course you should be able to achieve better speeds with a specialized implementation. And again, I'd prefer using another "low-level allocator" - probably Heap* on win32.

As for cmpxchg, I think it's safe for UP MT handling, but I would have to look at intel docs to say anything about MP.
Posted on 2003-03-31 04:31:59 by f0dder
Originally posted by f0dder
I would personally base a custom heap manager on something more lowlevel than libc malloc

Exactly my point. :)

It didn't seem to cause any performance difference switching libc malloc to raw HeapAlloc though.

Hmm, this is interesting. But, again, I don't know how MS (not to mention other C compiler vendors who don't make libc source code available) implements malloc() in more recent versions. Maybe there is something fundamentally different in Thorpe's code. I should take a closer look.
Posted on 2003-03-31 05:06:50 by Starless