Keep in min that VC6 is approaching 10 years of age (was released sometime during 1998, according to wikipedia). So no wonder improvements have been made since that :). VC2005 can even do full program analysis, with code generation at link-time - that's pretty neat.

You can still do assembly output with VC2005 that's easy enough to read. Drop to the commandline, and do "cl /O1 /FAs module.cpp" - or drop the /O1 and turn off optimizations entirely. (O1 = optimize for size, O2 = optimize for speed, Ox = max optimizations).


So there might stil be hope of writing an entire asm function that improves on the compiler.

Very likely :)

Even though compilers (MSVC, Intel, and GCC all do, iirc) have "intrinsics" for MMX/SSE, they're pretty far from generating optimal code - so if you get your head around those instruction sets, well... :)


However its somewhat of a mute point as the function i've been testing on isn't actually going to be used, it was just a simple test.

It's not a moot point as long as there's a learning experience involved!


I'm not entirely sure it can, there are some obvious changes that could be made, but as such i can't see them making a huge difference, but how knows perhaps i'll learn something new.

Rewriting the innerloop to only do full 32-bit reads: read 3 dwords/4x24bpp pixels, do the necessary transformations, and write out the three dwords. That should give a performance increase. Of course you'll need a little extra code outside the innerloop for taking care of bitmaps that arent a multiple of 4 wide.


So for sake of completeness i'll post the function, but we are rapidly moving away from talking about asm.

Yes and no - when optimizing something, you start by doing higher-level/algorithmic optimizations (sometimes data organization, when you have control of it, can matter a lot too), and see how far that gets you. Then you can do do an assembly implementation, use MMX/SSE et cetera.

Sometimes assembly allows you to express algorithms in ways that would be hard to do in a HLL - that can also lead to some pretty interesting optimizations.
Posted on 2007-01-17 09:28:58 by f0dder
Sorry for killing your fun, guys, but DirectX can flip images both horizontally and vertically in hardware (using virtually no time), so what's the point? :P
Posted on 2007-01-17 09:58:18 by ti_mo_n

Sorry for killing your fun, guys, but DirectX can flip images both horizontally and vertically in hardware (using virtually no time), so what's the point? :P

The learning experience - it can be beneficial to re-invent the wheel every now and then. Also:

There's probably some hardware-accelerated way to do your flipping with DirectX, which you should look into if you really want maximal speed, but let's focus on optimizing the algorithm instead - that's a bit more fun.

Posted on 2007-01-17 10:05:08 by f0dder

Sorry for killing your fun, guys, but DirectX can flip images both horizontally and vertically in hardware (using virtually no time), so what's the point? :P


... and then you missed one of the biggest points of using ASM... to cut down on unnecessary bloat ;)
Posted on 2007-01-17 11:54:13 by SpooK

It's not a moot point as long as there's a learning experience involved!

Oh I completely agree, thats why i don't believe this has been a waste of my time and there is still plenty more to learn, not only about asm, but as you and others have said writing C++ code in a way that accounts for the underlying architecture and structure of how the cpu and memory works. I'm still reading the AOA, Agner reources and several other pdf's/websites i've come across the last few days.

Its just a shame its not going to be as straightforward as it looked like it might have been at the beginning. I actually had a look at the output for my gamma ramp algoirthm, yikes, moving into floating point calculations in asm is a whole new scary place to be ;) I wonder if its worth doing it as fixed point math  :lol:


Rewriting the innerloop to only do full 32-bit reads:


Yep this looks like a good oppertunity to gain some performance, definately something i'll look into. Maybe I'll do it in C++ first and see what happens. However i'm not sure how well i'll get on trying to insert asm into vc2005 as it is far less clear in its use of registers. Heck I can't even follow most of the asm it produces now. Going to have to read up on how values are passed to functions and the like to get to grips with it.

One big disappointed with vc2005 though is that its literally doubling the size of my dll's compared to vc6 and I can't work out the reason. I assumed the initial size increase i saw was due to some essential parts of dll just being bigger, as someone suggested the 'vc2005 runtime'. That I could live with, a small one time hit per dll, but it doesn't appear to be the case. I transfered over another project and that doubled in size from 64k to 116k, so it looks like everything is being doubled in size, which for downloadable components isn't good. I will have to investigate this further.


Sorry for killing your fun, guys, but DirectX can flip images both horizontally and vertically in hardware (using virtually no time), so what's the point?


Well wheres the fun, challenge and achievement in doing that. If it wasn't for the fact that i started off using a 10 year old compiler, i'd be an asm guru getting a four times the performance just by converting a few lines of c++ 8) Thats a whole lot of fun and achievement, somewhat lessened now by the fun spoiling vc2005 compiler.  :sad:

It also doesn't account for several of the requirements I needed to meet in the first place, such as R & B swaping and applying different gamma ramp algorithms, although I wouldn't be surpirsed if that could also be done in DirectX. Not sure I really want to learn DirectX just for that, although ironically at some point in the future the resultant image will be dumped into an opengl texture, via hooking and other voodoo since Director has never had a SDK of the 3D engine released.


thanks
Posted on 2007-01-17 13:57:17 by noisecrime

moving into floating point calculations in asm is a whole new scary place to be ;)

Yeah - I hate it. Stack-based and all, eek. You could use SSE for floating-point stuff, which is register based... but then you set a minimum CPU, and there are other things to consider as well.


However i'm not sure how well i'll get on trying to insert asm into vc2005 as it is far less clear in its use of registers.

EAX,ECX,EDX are freely trashable, EBX,ESI,EDI,EBP need to be preserved (as in "if you change them, push/pop" not "always push and pop them" (duh)). Generally I stay away from inline asm though, more bother than it's worth, can conflict with the compiler optimization, etc.


I transfered over another project and that doubled in size from 64k to 116k, so it looks like everything is being doubled in size, which for downloadable components isn't good. I will have to investigate this further.

Nah, it's not because of a doubling in size, but the additional runtime overhead is larger - I think it's around 50kb or so, which seems coherent with your results. If you're going to use a lot of DLLs in a project, you can move to the dynamic-link version of the runtime.

A size increase of this order should hardly be seen as a problem, though - even back in the 486 times, 50kb wasn't much (except if you still did realmode programming). It's data that's the killer these days.

Generated code itself might be a bit larger, though. Usually optimizing for speed means more code (loop unrolling, anyone?).


It also doesn't account for several of the requirements I needed to meet in the first place, such as R & B swaping and applying different gamma ramp algorithms, although I wouldn't be surpirsed if that could also be done in DirectX.

It's been a fair amount of years since I've used DirectX, but I wouldn't be surprised if the conversion is a simple as setting the right bitmap formats and doing a blit - which is likely to be hardware accelerated. Iirc there's also gamma stuff in the more recent versions, although if you need it to be precise it might be a good idea to do it by hand (both ATi and NVidia have been known to cut some corners precision-wise, to achieve better benchmark results, at least in the past.)


Not sure I really want to learn DirectX just for that, although ironically at some point in the future the resultant image will be dumped into an opengl texture

OpenGL should be able to do all this as well, also hardware accelerated :)
Posted on 2007-01-17 15:23:57 by f0dder
The learning experience - it can be beneficial to re-invent the wheel every now and then. Also:

... and then you missed one of the biggest points of using ASM... to cut down on unnecessary bloat ;)

I know, I know :)

It also doesn't account for several of the requirements I needed to meet in the first place, such as R & B swaping and applying different gamma ramp algorithms, although I wouldn't be surpirsed if that could also be done in DirectX. Not sure I really want to learn DirectX just for that, although ironically at some point in the future the resultant image will be dumped into an opengl texture, via hooking and other voodoo since Director has never had a SDK of the 3D engine released.

Yes, as f0dder already said, Both DX and Opengl can swap B & R components, flip (horizontally and vertically), and perform gamma-stuff. Everithing hardware accelerated, or MMX/SSE if emulated (at least on DX. I've never seen opengl emulating anything, but I have never had a chance to see though).
Posted on 2007-01-17 18:14:12 by ti_mo_n
After looking a bit at your code, it seems like the input format is RGBA and you want 0BGR? (or BGRA->0RGB, same deal really). For vs2005, try this on for size:

#include <intrin.h>

typedef unsigned int uint;
typedef unsigned char uchar;

void FlipVertical(uchar* tSrcImagePtr, uchar* tDstImagePtr, uint iWidth, uint iHeight)
{
const uint iRowBytes    = iWidth*4;
const uint iImageBytes  = iWidth*iHeight*4;

uint i,x;
uchar bRed, bGreen, bBlue;
uint *src;
uint *dst;

dst = (uint *) tDstImagePtr;
src = (uint *) tSrcImagePtr;
src = src + iImageBytes - iRowBytes;

// Loop through each line
for (i=0; i<iHeight; i++)
{
for (x=0; x<iWidth; x++)
{
// Extract the current RGB values - eventualy this will be on a24 bit RGB values no alpha
// RGBA - orig
// ABGR - after BSWAP
// 0BGR - after AND
*dst++ = _byteswap_ulong(*src++) & 0x00FFFFFF;
}

// Decrement src pointer by a line
src = src - iRowBytes - iRowBytes;
}
}


...pure 24bpp input would be more interesting (or annoying) :)
Posted on 2007-01-17 22:09:22 by f0dder
Okay, couldn't sleep (yay for imsomnia), whipped up some code for 24bpp RGB -> 32bpp 0BGR. Pretty bad code, doesn't handle images that aren't a multiple of 4 pixels wide, etc etc. I just wanted to show how big difference it does processing dwords rather than bytes does, even with lame code. On my AMD64, I get ~1.6x speedup:

5000 iterations of f1: 2328 ticks
5000 iterations of f1: 2594 ticks
5000 iterations of nc1: 3781 ticks


Would be fun seeing what some of the skilled programmers around here could come up with - even without MMX or SSE, a dedicated assembly implementation should bring some nice improvement :)
Posted on 2007-01-17 23:28:13 by f0dder
thanks for your efforts f0dder, i'll have a look through the code when i get a chance.

At the moment i'm trying to improve the gamma, well brightness really function i might need to use, as for a 1280x960 image it takes a whopping 82ms, including the vertical flip and R/B swap  :shock:

I noticed a few aspects of the C++ code you provided that i'd like to determine the reasoning behind, If as I suspect they help the compile then i'l defiantely incorporate them into my code.

1. *src and *SrcRow - two pointers to the source
I can see why this was done, it allows src to iterate through the bytes of a pixel for a line, whilst srcRow allows for jumping through the lines of the image. What are the benifits of this method? I can see it removes the need to deduct 2*iRowBytes at the end of each line, but then you have to reset src to scrRow afterwards. Perhaps there is a small gain (i've not looked at the asm), but perhaps there are other better reasons for doing this?

2. Temp variables
I noticed inside the inner loop you declare the byte variables for holding r,g,b again what are the benifits in doing this. I was under the impression that in a good c++ implementation that these would get deleted once outside the scoop of the inner for loop, but would have to be recreated for the next line.

3. Byteswap
Took a while to register what was going on with this function, until i looked properly at the little table comment you provided. Very nice. Going to have to take a good look through this code.

It didn't occur to me that you could byteswap with a shift to go from 0RGB to BGR0 - have to remember that. Although I don't think i'll need to do the btyeswap as i'm pretty sure Director image objects are BGRA ( image objects in Director (the destination)  don't support 24 bits, they are stored as 32bits with a flag to indicate if alpha is used or not. I guess for performance, memory layout reasons).

Now i'll just have to see if all the camera output modes support line widths that are divisable by 4.

Ok, i'm going to look if i can incorporate some of the code into my test functions. However whilst i'm very grateful for the effort you've put into this, I must draw your attention that this specific function was only for testing and getting to grips with asm. For the real project other code considerations might make aspects of it less suitable. I'll post more details about that in a bit, but I didn't want you to get too carried away, investing your time into something that I may not use directly. Conversely though, its all still great learning stuff for myself and hopefully anyone else who stumbles into this thread. So thanks again
Posted on 2007-01-18 08:23:53 by noisecrime
At the moment i'm trying to improve the gamma, well brightness really function i might need to use

Please know that integer image manipulation is best done using MMX, so you should go for it if you want some incredible speed improvements. As for brightness: PADDUSB instruction is your friend here ;)
Posted on 2007-01-18 14:52:12 by ti_mo_n
Thanks for the information ti_mo_n, had no iea that MMX would be useful for interger manipulation, then again having never looked at it, I guess I really don't know what MMX is for anyway.

PADDUSB - wow what a great little instruction, looks like i'm going to have to start reading up on MMX along with everything else i've got on my to dolist - or perhaps I should just finish the project and play later, I never expected to spend a week on it ;)

Somewhat frustrating to learn about the function now though, having spent all day optomising the algorithm and getting pretty decent results in the end.
Posted on 2007-01-18 18:42:30 by noisecrime
This topic describes MMX and SSE a bit.
Posted on 2007-01-18 20:32:32 by ti_mo_n
about compilers being good these days...

yesterday i looked at what vc2k5 would generate for a simple dst=src loop..

and it really blew my head off!

dont have the exact code but it had only ONE pointer increase!

there was something like

mov eax, ,
mov , eax
add esi,4


so i suppose the compiler has calculated the difference between the highest and the lowest address of source and dest pointer... and put it into ebx.

I FOUND THIS AMAZING!

well, let me explain ; for someone else maybe its very simple , and INDEED ITS IS!
especially when you first had sthg like
add esi,4
add edi,4
but the reason it blew my head off is that its something i've thought of quite a lot (well not really but you get the idea) and it never appeared to me! and now i think msvc is an outstanding piece of software.. (the ability to debug and see assembly in a window etc, btw once i couldnt debug properly, it told me the breakpoint would not be hit and i didnt understand, i think maybe it was because the lib(tinyptc) was using loadlibrary? anyway its not the matter)

so.. the benefit is that you save the second ADD... but you still need TWO regs, even if its not two pointer to src and dst...so you're not saving one reg.

but i thought:
you could save this reg by using an uimmediate: mov , eax
the reason you cannot do this is that src and dst are variables... not known at compile time...
so: with synamic code generation, it would work! in fact youjust have to patch the DWORD difference... would this work? (ofcourse thats four bytes to embed in the instruction but i hope its not a big hit ...)
then youve got one more reg available in your loop...

btw this leads us to another thing that impressed me:
i had coded image manipulation routines in C that made AHellOfALot of src+src[(i-1)i*3+2]
etc... accessing 24bpixels and near pixel you know... and the compiler optimized the whole thing like i would have done, i mean NOT A MUL in the loop! it had figured out it could use a pointer and do ptr+=3 each loop. thats great! (in fact i've seen this at school, it differntiates the expression you put in the loop, but i wouldnt have thought it could really get it right!)

Posted on 2007-01-19 12:21:28 by HeLLoWorld

with synamic code generation, it would work! in fact youjust have to patch the DWORD difference... would this work? (ofcourse thats four bytes to embed in the instruction but i hope its not a big hit ...)

it would work, but SMC code can end up slow - unless it's "generate once, use many". And you have to either make your code section writable (not necessarily a good idea), or generate function in writable+executable memory.
Posted on 2007-01-19 16:36:37 by f0dder