Hi, I'm trying to code a dll with masm that should export functions using SIMD-instructions (XMM). These functions should be especially for the use of vector and matrix calculations, because for 3D-stuff it would be good to speed them up.

Pieces of asm-code come from the book"Real-time rendering and software technology" from Alan Watt/Fabio Policarto.

This following code is the dll-code that should export a mult-matrix function.
I try to use the Dll in Delphi, where I've declared a matrix as an array of 16 "Singles", that means the array allocates 16 times 4 bytes.
The returned matrix is declared in the .data segment as m2, allocationg also 16 times 4 bytes (m2 dd dup 16). "ths" and "m2" are the matrices to be multiplied.

My problem is, that Delphi returns an exception, and I guess I've done sth. wrong in my dll... I hope you can help me finding my error.

.586

.xmm
.model flat, stdcall
option casemap: none

.data

m2 dd 16 dup (0.0)

.code

MultMatrix proc ths: DWORD, m1: DWORD ;this should be the adresses
; of the passed matrices
mov edi, m1

movaps xmm4, [edi]
movaps xmm5, [edi+16]
movaps xmm6, [edi+32]
movaps xmm7, [edi+48]

mov esi, ths
mov eax, 0

L1:
movaps xmm0, [esi+eax]
movaps xmm1, xmm0
movaps xmm2, xmm0
movaps xmm3, xmm0

shufps xmm0, xmm2, 000h
shufps xmm1, xmm2, 055h
shufps xmm2, xmm2, 0AAh
shufps xmm3, xmm3, 0FFh

mulps xmm0, [edi]
mulps xmm1, [edi+16]
mulps xmm2, [edi+32]
mulps xmm3, [edi+48]

addps xmm0, xmm1
addps xmm0, xmm2
addps xmm0, xmm3

movaps m2+eax, xmm0 ; by the way - whats wrong with this line of code?
; my masm says its wrong use of registers

add eax, 16
cmp eax, 48
jle L1

mov eax, offset m2 ; m2 should be the resulting matrix
ret
MultMatrix endp

DllMain proc hInstDll :DWORD,
dwNotification :DWORD,
lpReserved :DWORD
ret
DllMain endp

end
Posted on 2003-12-22 11:32:32 by mathias l.
It might help if you told exactly which part of the code faults...

Other than that, a couple of suggestions:
*) preserve registers. I assume you're importing the functions as STDCALL in your delphi code, and thus you must preserve esi+edi+ebx+ebp (ie, add "uses esi edi" to your MultMatrix proc).

*) iirc, SSE data has to be 16-byte aligned - otherwise you'll get exceptions (well, unless you fiddle with the SSE control flag, then you'll just suffer from slow speed).
Posted on 2003-12-22 11:59:36 by f0dder
movaps m2+eax, xmm0

Well, this is not an addressing mode. You cannot just add eax to the label in this way, maybe you meant this:

lea ecx,
movaps , xmm0
Posted on 2003-12-22 12:13:18 by donkey
a label is a variable name is an offset... as long as you're not dealing with locals (which m2 isn't).
try:


movaps [m2+eax], xmm0
Posted on 2003-12-22 12:17:58 by f0dder
Hi,
I tried assembling with
movaps [m2+eax], xmm0
, that seems to be the right way, although I still don't know if it works, but at least it let's me create the dll without assembling errors.
I've tried to push the four registers esi,edi,ebx,ebp at the beginning of the code and popped them afterwards, but still the exception occurs.

To see exactly the position where the code faults, I tried several versions of the dll.
Its at the point
movaps xmm4, [edi]

just at the beginning of the code. It must be because of because I've tried "movaps xmm4, xmm4", to see if it's a problem with the instruction "movaps", but that worked!
So perhaps its really because of the 16-bit Data alignment... By now, I don't know how to tell Delphi, that it should align its Data on 16-bit. But I'll consult a Delphi Forum for that.

I'll try to translate the text from the Delphi-Exception:

GERMAN:
"EPrivilege wird ausgel?st, wenn eine Anwendung versucht, eine Prozessoranweisung auszuf?hren, die in der gegenw?rtigen Prozessor-Vorrangstufe nicht zul?ssig ist."

My ENGLISH:
"The EPrivilege-Exception happens, if an application tries to execute a processor-command that is not admitted in the actual processor-priority-mode"



Thank you very much for any idea!
Posted on 2003-12-22 13:40:32 by mathias l.
Privilege exception - then it seems like it's the 16 byte (not bit) thing that's the problem - especially since it's with a mov instruction :). There's surely a way to change data alignment in delphi, otherwise you'll have to do it with dynamically allocated memory and manual fixups.
Posted on 2003-12-22 13:54:13 by f0dder
Well, I've consulted some Delphi-Forums, but the compiler-directive that was recommended to me seems not to be available at my version of Delphi.
So I'll try it with allocating the memory dynamically. What do you mean with manual fixups?
Posted on 2003-12-23 06:17:16 by mathias l.
"manual fixups" as in allocating requiredmem+16 bytes, and adjusting the pointer manually for 16byte alignment (of course saving the original pointer so you can free the memory again). There's bound to be ready-made implementations of this floating around, even for delphi. Note that this is a rather wasteful method for small allocs, there's a lot of strategies for improving on this.
Posted on 2003-12-23 06:26:04 by f0dder
mathias,

f0dder ir right if you are making many small allocations but if you are making larger ones, dynamically aligning memory is very simple.

Here is a MASM macro for memory alignment by powers of 2.


memalign MACRO reg, number
add reg, number - 1
and reg, -number
ENDM

"number" in the macro is an immediate number. You should be able to code the 2 instructions in Delphi inline asm to align the memory you allocate.


add eax, 15
and eax, -16

eax is the address of the memory you allocate.

Regards,
http://www.asmcommunity.net/board/cryptmail.php?tauntspiders=in.your.face@nomail.for.you&id=2f46ed9f24413347f14439b64bdc03fd
Posted on 2003-12-23 07:49:11 by hutch--
Hi, after having some christmas-action, I'd like to thank you all for helping me to find a solution realizing my SIMD-Dll. The MultMatrix function does work now, and if anyone's intrested in it, heres the asm-dll and the delphi code, that uses the dll:
.586

.xmm
.model flat, stdcall
option casemap: none

.code

MultMatrix proc ths: DWORD, m1: DWORD, m2:DWORD ; This are two adresses of 4x4Matrices
; which are 16*4 Bytes long
push esi
push edi
push eax
push edx

mov edi, m1

movaps xmm4, [edi]
movaps xmm5, [edi+16]
movaps xmm6, [edi+32]
movaps xmm7, [edi+48]

mov esi, ths
mov eax, 0

L1:
movaps xmm0, [esi+eax]
movaps xmm1, xmm0
movaps xmm2, xmm0
movaps xmm3, xmm0

shufps xmm0, xmm2, 000h
shufps xmm1, xmm2, 055h
shufps xmm2, xmm2, 0AAh
shufps xmm3, xmm3, 0FFh

mulps xmm0, [edi]
mulps xmm1, [edi+16]
mulps xmm2, [edi+32]
mulps xmm3, [edi+48]

addps xmm0, xmm1
addps xmm0, xmm2
addps xmm0, xmm3

mov edx, m2
add edx, eax
movaps [edx], xmm0

add eax, 16
cmp eax, 48
jle L1

pop edx
pop eax
pop edi
pop esi

ret
MultMatrix endp

DllMain proc hInstDll :DWORD,
dwNotification :DWORD,
lpReserved :DWORD
ret
DllMain endp

end


Delphi:

unit Unit1;


interface

uses
Windows, Messages, SysUtils, Classes, Graphics, Controls, Forms, Dialogs,
StdCtrls;

type
TMyMatrix = array[0..15] of Single;
PMyMatrix = ^TMyMatrix;

TForm1 = class(TForm)
Button1: TButton;
procedure Button1Click(Sender: TObject);
private
{ Private-Deklarationen }

public
{ Public-Deklarationen }
end;

var
Form1: TForm1;

procedure MultMatrix(x, y, z: Pointer); stdcall; external 'test.dll';

implementation

{$R *.DFM}

procedure TForm1.Button1Click(Sender: TObject);
var
pa, pb, pc: PMyMatrix;
p1, p2, p3: Pointer;
i: Integer;
begin


GetMem(p1, SizeOf(TMyMatrix)+15);
pa:= PMyMatrix((Integer(p1)+$0F) and $FFFFFFF0);
ZeroMemory(pa, SizeOf(TMyMatrix));

GetMem(p2, SizeOf(TMyMatrix)+15);
pb:= PMyMatrix((Integer(p2)+$0F) and $FFFFFFF0);
ZeroMemory(pb, SizeOf(TMyMatrix));

GetMem(p3, SizeOf(TMyMatrix)+15);
pc:= PMyMatrix((Integer(p3)+$0F) and $FFFFFFF0);
ZeroMemory(pc, SizeOf(TMyMatrix));

for i:=0 to 15 do pc^[i]:=0.0; // pc is an empty matrix, pa, bb are
// arbitrary 4x4 Matrices

pa^[00]:= 1.0; pa^[01]:= 2.0; pa^[02]:= 3.0; pa^[03]:= 1.0;
pa^[04]:= 2.0; pa^[05]:= 3.0; pa^[06]:= 1.0; pa^[07]:= 2.0;
pa^[08]:= 3.0; pa^[09]:= 1.0; pa^[10]:= 2.0; pa^[11]:= 3.0;
pa^[12]:= 1.0; pa^[13]:= 2.0; pa^[14]:= 3.0; pa^[15]:= 1.0;

pb^[00]:= 3.0; pb^[01]:= 2.0; pb^[02]:= 1.0; pb^[03]:= 3.0;
pb^[04]:= 2.0; pb^[05]:= 1.0; pb^[06]:= 3.0; pb^[07]:= 2.0;
pb^[08]:= 1.0; pb^[09]:= 3.0; pb^[10]:= 2.0; pb^[11]:= 1.0;
pb^[12]:= 3.0; pb^[13]:= 2.0; pb^[14]:= 1.0; pb^[15]:= 3.0;




MultMatrix(pa, pb, pc);

//pc is the Result

FreeMem(p1);
FreeMem(p2);
FreeMem(p3);
end;

end.


Merry Christmas!
Posted on 2003-12-26 05:54:02 by mathias l.
Have you tried using MMXasm?
http://www.yks.ne.jp/~hori/MMXasm-e.html
Posted on 2003-12-26 06:29:03 by Delight
mathias, be sure to return TRUE in your dllmain when it's called with DLL_PROCESS_ATTACH...
Posted on 2003-12-27 01:16:13 by f0dder
Hi Mathias !
You have done a excellent job. But, sorry, I have some ideas.
Why do you not use BASM in Delphi. Are you using Delphi 5 or above. BASM in Delphi supports all instructions found in the Intel Pentium III, Intel MMX extensions, Streaming SIMD Extensions (SSE), and the AMD Athlon (including 3D Now!). I extract some note from Delphi Help:
"The built-in assembler supports all of the Intel-documented opcodes for general application use. Note that operating system privileged instructions may not be supported. Specifically, the following families of instructions are supported:

Pentium family
Pentium Pro and Pentium II
Pentium III
Pentium 4

Note: Pentium 4 instructions are only supported on Windows.

In addition, the built-in assembler supports the following instruction sets

AMD 3DNow! (from the AMD K6 onwards)
AMD Enhanced 3DNow! (from the AMD Athlon onwards)"

From your code, i modify it to use BASM in Delphi. It run OK.
program Test;

{$APPTYPE CONSOLE}

uses
Windows, SysUtils;

type
PMyMatrix = ^TMyMatrix;
TMyMatrix = array[0..15] of Single;

procedure MultMatrix(ths, m1, m2: Pointer);
asm
// According to fastcall convention in Delphi
// eax = ths, edx = m1, ecx = m2
// We do not need to reserve eax, ecx and edx, only reserve esi, edi, ebx

push esi
push edi

mov edi, m1

movaps xmm4,
movaps xmm5,
movaps xmm6,
movaps xmm7,

mov esi, ths
mov eax, 0

@@L1:
movaps xmm0,
movaps xmm1, xmm0
movaps xmm2, xmm0
movaps xmm3, xmm0

shufps xmm0, xmm2, 000h
shufps xmm1, xmm2, 055h
shufps xmm2, xmm2, 0AAh
shufps xmm3, xmm3, 0FFh

mulps xmm0,
mulps xmm1,
mulps xmm2,
mulps xmm3,

addps xmm0, xmm1
addps xmm0, xmm2
addps xmm0, xmm3

mov edx, m2
add edx, eax
movaps , xmm0

add eax, 16
cmp eax, 48
jle @@L1

pop edi
pop esi
end;

procedure DoMult();
var
pa, pb, pc: PMyMatrix;
p1, p2, p3: Pointer;
i: Integer;
begin
GetMem(p1, SizeOf(TMyMatrix) + 15);
pa := PMyMatrix((Integer(p1) + $0F) and $FFFFFFF0);
ZeroMemory(pa, SizeOf(TMyMatrix));

GetMem(p2, SizeOf(TMyMatrix) + 15);
pb := PMyMatrix((Integer(p2) + $0F) and $FFFFFFF0);
ZeroMemory(pb, SizeOf(TMyMatrix));

GetMem(p3, SizeOf(TMyMatrix) + 15);
pc := PMyMatrix((Integer(p3) + $0F) and $FFFFFFF0);
ZeroMemory(pc, SizeOf(TMyMatrix));

for i := 0 to 15 do pc^ := 0.0; // pc is an empty matrix, pa, bb are
// arbitrary 4x4 Matrices

pa^[00] := 1.0; pa^[01] := 2.0; pa^[02] := 3.0; pa^[03] := 1.0;
pa^[04] := 2.0; pa^[05] := 3.0; pa^[06] := 1.0; pa^[07] := 2.0;
pa^[08] := 3.0; pa^[09] := 1.0; pa^[10] := 2.0; pa^[11] := 3.0;
pa^[12] := 1.0; pa^[13] := 2.0; pa^[14] := 3.0; pa^[15] := 1.0;

pb^[00] := 3.0; pb^[01] := 2.0; pb^[02] := 1.0; pb^[03] := 3.0;
pb^[04] := 2.0; pb^[05] := 1.0; pb^[06] := 3.0; pb^[07] := 2.0;
pb^[08] := 1.0; pb^[09] := 3.0; pb^[10] := 2.0; pb^[11] := 1.0;
pb^[12] := 3.0; pb^[13] := 2.0; pb^[14] := 1.0; pb^[15] := 3.0;

MultMatrix(pa, pb, pc);

//pc is the Result
FreeMem(p1);
FreeMem(p2);
FreeMem(p3);
end;

begin
DoMult();
ReadLn;
end.

Best regards !
TQN
Posted on 2003-12-27 02:34:39 by TQN

Hi Mathias !
You have done a excellent job. But, sorry, I have some ideas.
Why do you not use BASM in Delphi. Are you using Delphi 5 or above. BASM in Delphi supports all instructions found in the Intel Pentium III, Intel MMX extensions, Streaming SIMD Extensions (SSE), and the AMD Athlon (including 3D Now!). I extract some note from Delphi Help:
"The built-in assembler supports all of the Intel-documented opcodes for general application use. Note that operating system privileged instructions may not be supported. Specifically, the following families of instructions are supported:

Pentium family
Pentium Pro and Pentium II
Pentium III
Pentium 4

Note: Pentium 4 instructions are only supported on Windows.

In addition, the built-in assembler supports the following instruction sets

AMD 3DNow! (from the AMD K6 onwards)
AMD Enhanced 3DNow! (from the AMD Athlon onwards)"

From your code, i modify it to use BASM in Delphi. It run OK.

...

Best regards !
TQN


Yep, and if he's not running Delphi5 or above he can use MMXasm. It's an extension than converts asm-code to db XX,XX,,XX statements.
Posted on 2003-12-27 03:29:48 by Delight
Hi, I've tried to to it on my Delphi 5 Enterprise Version with inline-asm, but "movaps" and the other SSE-instruction caused errors. Are you sure that Delphi 5 supports them?
In other case, I'll try the MMXasm-library.

Thanks a lot for helping!
Posted on 2003-12-28 12:50:59 by mathias l.
Hi mathias l.
I am sorry. I have a mistake. You need Delphi 6 or 7 to build your code.
Best regards
TQN
Posted on 2003-12-28 21:24:48 by TQN
I apologize for exhumating this thread

I'm really interested in that :

Originally posted by f0dder

*) iirc, SSE data has to be 16-byte aligned - otherwise you'll get exceptions (well, unless you fiddle with the SSE control flag, then you'll just suffer from slow speed).


f0dder, could you explain a bit how to fiddle with that flag in order to have the cpu execute SSE instructions on non-16-bytes-aligned memory without complaining ?

thanks a lot in advance
Posted on 2004-01-16 05:12:37 by Chrishka
Hm, I can't seem to find information on disabling exceptions on unaligned access, so perhaps I got some things confused when reading the references last time - at least there's no mention of it on LDMXCSR/STMXCSR - bleh. It would be slow anyway, so it's probably a good thing you can't disable the exceptions :)

If you really need unaligned access, use movups, movupd.

The MXCSR can be used for other SSE control, though... like rounding modes.
Posted on 2004-01-16 06:38:24 by f0dder
arg too bad, it would've helped me a lot. of course there's movups and so on, but my problem is with 'psadbw xmm,mem128'
think I'll have to do it in two times with the mmx version
Posted on 2004-01-16 06:51:05 by Chrishka
Why not make sure your data is aligned?
Posted on 2004-01-16 06:53:01 by f0dder