Hi,

Im looking for some help putting together a DLL that can be used from Visual Basic 6. Ive googled and searched and cant find enough pieces of the puzzle to really get me on the right track.

Im not sure if I need just a ASM compiler or a C++ compiler with inline ASM support to do this. I also believe that VB handles dll's (CDECL vs STDCALL) differently, and that needs consideration. Im not sure of the best compiler to use (considering speed of compiled code).

Im wanting to write a few alphablend functions based on its callers needs, and would like to know what the fastest method's are. I've googled and I hear MMX registers are good, and that for a 50% blend only (via bitshifting) achieves even faster results but with a constant blend rate.

Passing the parameters to the dll concerns me, (push/pop vs speed), so I presume a way around this would be to pass a pointer to a structure with the paramters. I presume VB6's byref and safearray dont void this optimization.

Im looking for speed, but want to develop within VB6.  So Im guessing a DLL built from ASM is the best I can get to that goal.  Im a complete novice outside of VB6, so any help with some direction or templates for getting me going on this would be amazing.
Posted on 2011-12-08 18:31:31 by timfrombriz
Hi
The most experienced coder I know which can answere your questions is Obivan.
In general, it is not hard to interface a DLL with VB6.
A line like
Public Declare Sub MyProc Lib "MyLib" (ByVal DstPtr As Long, etc...)

does the trick to import the DLL procedure.
The rest is up to you.

Regards, Biterider


Posted on 2011-12-09 06:54:23 by Biterider
"ASM compiler"? No, it's called an assembler. Assembling is not quite the same as compiling. Assembly language is one of the few languages you don't compile.

Anyway...
Speedwise, it doesn't really matter whether you use ASM or C++ with inline ASM... That is, if you are writing the routine in 100% ASM anyway. With a __declspec(naked) function, the C++ compiler will not add any code to your function whatsoever, so if you write the function entirely in assembly, the result is the same as with an assembler.
You might want to choose an assembler anyway, since they will generally have a richer, more powerful assembly syntax, allowing you to use macros and such.

MMX is good, but SSE2 is even better. MMX is pretty outdated these days (deprecated in Windows x64). SSE2 allows you to use MMX operations on SSE registers, which are twice as wide, and not shared with the FPU (so no FPU stack cleanup overhead through emms afterwards).

As for calling overhead: the fastest way to call a function is not to call it.
What this means is: you should not try to make a function to blend a single pixel. No matter how fast you make the function itself, the calling overhead for every pixel will still have a significant impact.
You should make a single function that performs alphablending on an entire image at a time (or at least, part of an image, defined by a RECT or such, which you would probably want to pass by-ref, yes). That way, all the memory access, looping etc can be optimized along with the blend routine itself, in a single piece of code.
Since the time spent on blending images of any reasonable size is much larger than the calling overhead, this means you don't even have to worry about that, really.
I would personally choose STDCALL, since that is the same convention as used by the Windows API, and is therefore compatible with most languages. It is slightly more efficient than cdecl, since it doesn't require stack cleanup on the caller side (stack is cleaned up automatically with the ret).
Fastcall would theoretically be slightly faster, but it is less compatible with other languages (not sure if VB would support it). And in this case it is not worth the trouble, as the bulk of the time is spent in the alphablending routine (the good old 90/10 rule).

Also, yes 50% alphablending is just taking the average of 2 pixels.
You could use bitshifting, but it is not entirely accurate. You have to shift first, to make room for the addition.
So you'd do something like this:
result = (a >> 1) + (b >> 1);
The problem is easily demonstrated with the value 255:
result = (255 >> 1) + (255 >> 1) = 127 + 127 = 254.
You are off-by-one. Repeated alphablending will make your images darker and darker (some videocards actually suffered from an off-by-one error in alphablending btw).

So what you are looking for is more like this:
result = (a + b) >> 1;
But that requires 9 bits per pixel, not 8.
But, this is basically just the average of 2 values.
And luckily there already is an instruction for that in SSE: pavgb:
http://www.rz.uni-karlsruhe.de/rz/docs/VTune/reference/vc230.htm
It actually does ((a + b) + 1) >> 1, which is even more correct.
Namely, if you were to take the average of 0 and 1, we'd say that's 0.5, which we'd generally round up to 1.
A shift will always round down, so you'd get 0. By adding the 1 before the shift, it will round towards the nearest number.

I hope this helps somewhat.
Posted on 2011-12-09 07:13:16 by Scali
Hi BiteRider/Scali,

Thanks for the feedback.

So the best assembler to use, I looked thru and found MASM,FASM,TASM,YASM and HLA.  Out of these I choose FASM; open source, compiles itself, said to be  focused on speed and seems to be the most popular. Any feedback on this selection?

I think sticking with a assembler and not a C++ with inline ASM would be best for me, Id need to pick up both C++/asm and Im thinking bloat from the C++ compiler will get thrown into the release dll.  I did find some FASM code to build a dll callable from VB6 and it compiled to 5kb, which made me smile.

I read for VB dll's, it needs;

.MODEL STDCALL,FLAT for 32-bit ASM

My game engine runs on Windows (VB6 of course) and uses DirectDraw.  Ive found that speedwise its faster to lock the surfaces, pixel plot completely, unlock and flip the backbuffer to the screen, then to use DirectDraw's blitting functions. Im not sure why blting with native DirectDraw calls would be slower, perhaps my approach to pixel plotting is superior to hardware blitting (lol).  I created an instruction list per sprite of skipping/copying blocks of pixels between the two surfaces, rather than checking for transparency on a per pixel basis.

Its interesting you mention the SSE2 over MMX as more superior.  In my pixel plotting code, I use the RtlMoveMemory api to copy between surfaces, and i found this article on how MMX ASM improved speed over RtlMoveMemory;

http://www.persistentrealities.com/vbfibre/index.php?category=4&item=1&t=asm

I presume that SSE2 could further the benefit here as well.



Here is a DLL framework for FASM I found;


                                                  ; DLL creation example


format PE GUI 4.0 DLL
entry DllEntryPoint


include 'win32a.inc'


section '.code' code readable executable


proc DllEntryPoint hinstDLL,fdwReason,lpvReserved
        mov    eax,TRUE
        ret
endp


; VOID ShowErrorMessage(HWND hWnd,DWORD dwError);


proc ShowErrorMessage hWnd,dwError
  local lpBuffer:DWORD
        lea    eax,
        invoke  FormatMessage,FORMAT_MESSAGE_ALLOCATE_BUFFER+FORMAT_MESSAGE_FROM_SYSTEM,0,,LANG_NEUTRAL,eax,0,0
        invoke  MessageBox,,,NULL,MB_ICONERROR+MB_OK
        invoke  LocalFree,
        ret
endp


; VOID ShowLastError(HWND hWnd);


proc ShowLastError hWnd
        invoke  GetLastError
        stdcall ShowErrorMessage,,eax
        ret
endp


section '.idata' import data readable writeable


  library kernel,'KERNEL32.DLL',\
          user,'USER32.DLL'


  import kernel,\
        GetLastError,'GetLastError',\
        SetLastError,'SetLastError',\
        FormatMessage,'FormatMessageA',\
        LocalFree,'LocalFree'


  import user,\
        MessageBox,'MessageBoxA'


section '.edata' export data readable


  export 'ERRORMSG.DLL',\
        ShowErrorMessage,'ShowErrorMessage',\
        ShowLastError,'ShowLastError'


section '.reloc' fixups data discardable
 


I absolutely agree with you about not calling on a per-pixel basis, and rather work on a RECT section between a source/dest.  To minimize the amount of parameters to pass, would it be beneficial to pass the struct pointer to the parameters or does this cause a slow down (ie. CPU Caching, having to fetch stuff from RAM, Im making this stuff up, Ive just been reading heaps and these things are some things I've heard but may of interpreted wrongly).


Ie. My structure which Id fill for the function would be something like this;



Source        ; pointer to the memory location of (y*SourceScanline+x) for the source surface
Dest                ; pointer to the memory location of (y*DestScanline+x) for the dest surface
Width        ; the width of the source/dest blit area
Height        ; the height of the source/dest blit area
SourceScanline    ; how much to increment source to step each y axis of the source surface
DestScanline      ; how much to increment dest to step each y axis of the source surface
Alpha        ; a Byte I guess? with 256 different shade posibilities


I think the hardest thing for me is going to get my head around converting code to fit my needs.  My heads overloaded at the moment with information Ive been trying to absorb.


As a starting point, could maybe someone modify the code template for FASM above and fit it to rtlMovememory using SSE2 or MMX for speed, taking source,dest, len as parameters.  It'd be nice to have a real-world function to see what potential speed benefit I might get thru asm, and it seems (in my mind) a simple enough algorithm for me to look at and try and make sense of.


Thanks for your help and feedback this far.  It really is appreciated.
Posted on 2011-12-09 19:43:01 by timfrombriz

Im thinking bloat from the C++ compiler will get thrown into the release dll.


As I said: it won't.

Ive found that speedwise its faster to lock the surfaces, pixel plot completely, unlock and flip the backbuffer to the screen, then to use DirectDraw's blitting functions. Im not sure why blting with native DirectDraw calls would be slower, perhaps my approach to pixel plotting is superior to hardware blitting (lol).


Hardware blits are faster than CPU, but they only work if you blit from videomemory to videomemory.
So you'd need to upload all your graphics to surfaces in videomemory first, and blit from there.
Then again, DirectDraw has been deprecated for years now, so perhaps the drivers just aren't very optimized these days. You want to either use Direct3D, or the new Direct2D (Windows Vista or higher) these days. There is no specific blitting/2D hardware anymore, these days. Videocards perform all operations with textured polygons. It's not that difficult to set up a framework in Direct3D for doing 2D rendering.
Again the idea is the same though: create textures in videomemory first, then use the GPU to 'blit' them to screen (rendering 2 triangles to create a rectangular area with a texture).


I absolutely agree with you about not calling on a per-pixel basis, and rather work on a RECT section between a source/dest.  To minimize the amount of parameters to pass, would it be beneficial to pass the struct pointer to the parameters or does this cause a slow down (ie. CPU Caching, having to fetch stuff from RAM, Im making this stuff up, Ive just been reading heaps and these things are some things I've heard but may of interpreted wrongly).


Well, assuming you just put the parameters IN the struct, they are also cached.
So in general, things should be fine. There's just an extra indirection in that you have to read the pointer from the stackframe first, before you can access its members (could cost you an extra register).
But aside from that, regular parameters on the stack are just a struct as well, technically.
Posted on 2011-12-10 04:35:12 by Scali