HI,

I?ve just started my first foray at learning ASM, mainly for the purposes of speeding up some image copying and manipulation code, but also because I?ve discovered it to be quite fascinating and a new challenge. I?m in the process of reading ?The Art of Assembly Language? and trying out small bits of test code to get a better feel of ASM.

I?m starting off small, so using the MSVC ability to dump out the assembly with source code to a txt file I?ve tried to see if I can optimise a single line from a function I?m working on. Though not purely for optimisation sakes, also to learn how the same thing can be achieved by different means. As a result I have a few questions and would welcome any comments.

The line in question is this


*dst++ = (unsigned long)( (byte)(bRed) | ((byte)(bGreen) << 8) | ((byte)(bBlue) << 16));


Where dst is a pointer to the start of the destination image and bRed, bGreen, bBlue are bytes representing the components. I?m using an unsigned long as it means once its been filled with the rgb components it?s a single copy to get a whole pixel into the destination address. This in itself gave 6% performance boost over doing it one byte at a time.

MSVC 6.0 compiles this line to the following asm


__asm{
mov edx, DWORD PTR bRed
and edx, 255 ; 000000ffH
mov eax, DWORD PTR bGreen
and eax, 255 ; 000000ffH
shl eax, 8
or edx, eax
mov ecx, DWORD PTR bBlue
and ecx, 255 ; 000000ffH
shl ecx, 16 ; 00000010H
or edx, ecx

mov eax, DWORD PTR dst
mov DWORD PTR , edx
mov ecx, DWORD PTR dst
add ecx, 4
mov DWORD PTR dst, ecx
}


This looked rather excessive to me and I figured that it could be improved. My first attempts didn?t get very far, but then I thought about using the fact the registers for 8 bits are shared with those of 16 and 32 (inspired by the bit packing of day,.month,year from AOA). So I came up with this


__asm{
mov dl, bBlue
shl dx, 8
or dl, bGreen
shl edx, 8
or dl, bRed

mov eax, DWORD PTR dst
mov DWORD PTR , edx
mov ecx, DWORD PTR dst
add ecx, 4
mov DWORD PTR dst, ecx
}


In testing it was nearly twice as fast as the original MSVC code which was nice, but I have a few questions.

1. Is this a legitimate use of the registers?
2. Is there anything to be wary of in doing this especially between different cpu?s or should it work for all modern PC?s?
3. Should registers be zeroed before using them?
I.E. is it good practice to add mov (edx,0) to the start of the code?
4. Is there a better/more efficient method  to achieve the conversion from the c++ code to asm?
5. Is there a better method of incrementing the dst pointer?
I think there probably is from reading AOA, but perhaps this can only be answered when looking at the whole function?
6. Why does MSVC appear to use all of the 32 bit a/b/c/d registers instead of re-using one or two?
7. Why does MSVC move each value into a register first, even when it apparently isn?t needed to?
8. Anything else I need to keep in mind, any ghastly coding practices mistakes I?ve made?

Finaly on my system Intel P4 2.6GHz single processor with hyper threading this almost doubled the performance of the function, however testing on a friend?s machine yields no difference. Sadly I don?t know the specs of his machine but is there any reason for this? The c++ code took 23ms, the asm took only 12ms, whilst on his machine both versions took 16ms. I?m going to try some more tests, but I don?t know of any reason why this should be.

thanks

Posted on 2007-01-15 17:59:41 by noisecrime
1. Yes. Compilers are generic when it comes to optimizations
2. You will be fine, as this code will work even on something as old as the 386 (and probably even the 286)
3. I would recommend doing so by simply Exclusive OR'ing (read-up on its logical function) edx with itself to achieve zero (i.e. "xor edx,edx")... but you are loading a DWORD value anyhow so it would be useless
4. Anything that is automated, will be generic. Invest in the "best" compilers around, and tweak computational intensive sections by hand
5. Simpy add the immediate value to the "dst" variable (i.e. "add DWORD PTR dst,4")
6. Answer to #1
7. Answer to #1
8. I wouldn't say ghastly, although I haven't examined your code all that hard while responding to these 8 questions. Dig into some x86 instruction set documentation (Intel's Docs would be a good start) and start reading. You are making good progress. Keep it up and keep your mind open ;)

Now, to tackle the code directly... though my MASM is a bit rusty...


__asm{
movzx eax, bBlue ;Moves bBlue into AL, and zeros-out the rest of EAX all in one operation
shl eax, 16 ;EAX = 00BB0000
mov ah, bGreen ;EAX = 00BBGG00
mov al, bRed ;EAX = 00BBGGRR

mov ebx, DWORD PTR dst
mov DWORD PTR , eax
add ebx, 4
mov DWORD PTR dst, ebx
}


A smarter way would be to align all 3 variables (bRed, bGreen and bBlue) into one DWORD value (sacrifice a null byte) so that you could load all values in one operation. Imagine this code example...


mov eax, DWORD PTR bRGB
mov ebx, DWORD PTR dst
mov DWORD PTR , eax
add ebx, 4
mov DWORD PTR dst, ebx


You can still access the individual values of bRGB by simply addressing their byte position...

bRed = BYTE PTR bRGB
bGreen = BYTE PTR bRGB+1
bBlue = BYTE PTR bRGB+2
BYTE PTR bRGB+3 always remains null

HtH ;)
Posted on 2007-01-15 18:51:15 by SpooK
Thanks for the prompt reply SpooK

1. Not sure I follow your reply to question one as I was asking if my use of the registers in my version of the code was acceptable practice. However looking at other parts of your reply i'm guessing they are fine. Just a bit worried when you go low level that something which works fine on one machine horribly breaks on antother ;)

3. Ok, so if i weren't laoding a DWORD its generally a good idea to zero registers (I understand the xor method)

5. Cool that looks far better than the 4 lines/instructions that MSVC did ;)

6. I was wondering if there was any benefit in execution times with using different registers for each part of the code - its almost like it cycles through them (depending on which register might be holding a value elsewhere in the code). But i'm guessing from your rpely this is just done to make the job of writing a compiler to assembly eaiser.

8. Thanks for the encouragement.

Your Code - DOH!

Funny i'd just be looking at the movzx opcode, but it looks like i was so fixated on the AOA date packing i'd completely missed the much more elgant solution you posted. This is exactly the sort of thing i was after, thank you. I really should have seen it though once i started using the fact that registers share lower bits. All those left shifts and or'ing in my code being completely superflous.

Yes it would be better to align all  the components into a single DWORD, however I have some restrictions/requirements for the eventual purpose of this code that makes it impractical (I think) for one i need to swap R and B from source to destination, i'm also going to be adding a gamma ramp to the function eventually. But still i appricate the code, its all worthwhile to look and learn from. Maybe from what you say I can do this even with the requirements i have, but i'll work my way up to that point.

No ideas on why the code didn't excutre faster on my friends machine?
I'm really puzzled to explain why that happens.

Thanks again
Posted on 2007-01-15 19:08:59 by noisecrime

Thanks for the prompt reply SpooK

1. Not sure I follow your reply to question one as I was asking if my use of the registers in my version of the code was acceptable practice. However looking at other parts of your reply i'm guessing they are fine. Just a bit worried when you go low level that something which works fine on one machine horribly breaks on antother ;)



There are general practices when working with an Intel x86-based ABI. Most notably is the register preservation of EBX/ESI/EDI/ESP/EBP when making calls. EAX/ECX/EDX are "expected" to be trashed and most functions don't depend on their values after a call, with exception to EAX being a common return value holder.



3. Ok, so if i weren't laoding a DWORD its generally a good idea to zero registers (I understand the xor method)



Whichever method makes you more comfortable in giving registers "known" values... as it will save you much time and many headaches in trying to trace down such small bugs due to such assumptions.



5. Cool that looks far better than the 4 lines/instructions that MSVC did ;)



You would also have to look at the surrounding code context to see if that particular code arrangement is for more reason than just that function. In a compiler, this would be doubtful due to generic optimizations... but it is a good habit to get into when optimizing.



6. I was wondering if there was any benefit in execution times with using different registers for each part of the code - its almost like it cycles through them (depending on which register might be holding a value elsewhere in the code). But i'm guessing from your rpely this is just done to make the job of writing a compiler to assembly eaiser.



Speed-wise, not really. Size-wise, EAX has some special opcodes for certain instructions that cause the instruction coding to be up-to a few bytes smaller. Once again, though, not much difference. More concern should be placed upon algorithmic optimizations and not so much on the instructions themselves.

As for the compiler, that is almost exactly what it needs to do. It is a smart thing to do, as you don't want to re-load a register if you haven't exhausted your supply of "free" registers.



8. Thanks for the encouragement.

Your Code - DOH!

Funny i'd just be looking at the movzx opcode, but it looks like i was so fixated on the AOA date packing i'd completely missed the much more elgant solution you posted. This is exactly the sort of thing i was after, thank you. I really should have seen it though once i started using the fact that registers share lower bits. All those left shifts and or'ing in my code being completely superflous.



I did the same thing with XOR. I knew the function of an Exclusive OR from electronics, but it never occurred to me that I could use XOR to zero-out a register on the x86 until someone informed me of it. I learned and adapted.



Yes it would be better to align all  the components into a single DWORD, however I have some restrictions/requirements for the eventual purpose of this code that makes it impractical (I think) for one i need to swap R and B from source to destination



Read-up on the "xchg" instruction, it might just be what you need.


No ideas on why the code didn't excutre faster on my friends machine?
I'm really puzzled to explain why that happens.


Even the oldest 386 computers executed more than one million instructions per second (ideally) so it stands to reason that reducing 20 lines of code to 5 is not going to make any noticeable impact unless you reiterate the operation billions, probably even trillions, of times. This type of optimization becomes more apparent when you work with larger data structures (e.g. working around cache misses and the like.)
Posted on 2007-01-15 20:48:22 by SpooK
Welcome to the world of asm.

The beauty of asm is the variety of ways to achieve a goal. Here's another way of loading a 32-bit register with an RGB value without any shifts. The reason I am suggesting this variation is because shifts can be slow on some CPUs.

xor edx,edx
mov dh,bBlue
bswap edx  ;same effect as shl edx,8
mov dh,bGreen
mov dl,bRed


Raymond
Posted on 2007-01-15 22:31:21 by Raymond
bah, couldn't sleep so got back up to continue playing ;)

Thanks for the extra info, very informative. I feel bad adding more questions, becuase quite frankly i'll probably never stop, so don't feel that you have to reply, I'm sure others are itching to jump in and help ;)

xchg - I looked this up online (btw is there a good standard reference for the opcodes?)
A useful function, but I can't see how it would help (directly) with swapping R and B. If I need to swap G and B then i could see it being very useful, but maybe i'm reading the description wrong?

As to performance, i'm running this on a 1024x768 sample image, so the loop is being executed 786,432 times, thus I do 'expect' some increase ;) Although using your code examples the difference was slight, but then again i'm kinda at the limit of accurately detecting timings at the ms level (went from 12 ms to 10/11ms). Doesn't matter the code you posted is far more elegant and i much prefer it.

In fact spurred on/inspired by the code you posted I converted some earlier code in the function


bRed   = *src++;
bGreen = *src++;
bBlue  = *src++;
*src++; // skip alpha


to


mov ecx, DWORD PTR src  ;Extract source address
mov edx, DWORD PTR ;extract long value

mov bRed, dl
mov bGreen, dh
shr edx, 16
mov bBlue, dl

add DWORD PTR src, 4 ;increment pointer


Which appears to work and has cut down the time of the function to just 7ms (original C++ was 23ms). Which is absolutely fantastic. Of course the cravet being that its a special case, being within a tight repeat loop being called 100,000's of times, so any small benefit, becomes considerable in the end.

I did wonder if i could reduce the two lines for getting the long data from the source pointer, but didn't have any luck. I'm guessing this is the most efficient means of getting a value from a pointer?

Sadly for my real function i'm not sure i can use this new bit of code as the source image is going to be RGB not RGBA (destination will still be RGBA). Up until the last pixel I guess i could and simply mask or ignore the byte from the next pixel, but when i hit the last pixel, i'm going to be grabbing a byte of data out of bounds and no idea if that will give a nasty surprise.

anyway thanks again for all your help, been very inspirational.
Posted on 2007-01-15 22:52:24 by noisecrime
Thanks Raymond ... I think - more stuff to consider ;)

Very interesting to learn that shl/shr can be slower, indeed from looking at a page i found about the opcodes (not a great page though) it looks like shl can take 3 cycles, whilst bswap only takes 1.

Definately something for me to investigate further sometime.
Posted on 2007-01-15 22:57:22 by noisecrime

Thanks Raymond ... I think - more stuff to consider ;)

Very interesting to learn that shl/shr can be slower, indeed from looking at a page i found about the opcodes (not a great page though) it looks like shl can take 3 cycles, whilst bswap only takes 1.

Definately something for me to investigate further sometime.



Just remember, however, that BSWAP is only available on the 486 and newer processors. I don't see this as a concern unless you are specifically expecting the program to be run on a 386 machine.

You can pretty much rely on the base instruction-set provided with the original Pentium/586 as the "minimum" for your programs, without much concern.
Posted on 2007-01-16 01:19:22 by SpooK
Raymond's suggestion nicely fits Pentium4's requirements: P4 has slower shifts, but doesn't suffer due to partial register stalls. Additionally, it can quickly remove a register from a dependency chain if it has been zerored by xor-ing it. This code, however, is supposed to be much slower on PII/PIII (although I haven't performed any benchmarks, so it's just a theoretical thought).

Of course we can make it even faster on P4 by using MMX ^^
Posted on 2007-01-16 02:14:50 by ti_mo_n
thanks for the addiitonal info guys,

Is there a website or document I can get which gives a pretty good oultine of what instructions are supported on which Intel chips and relative performances of the different opcodes?

I very much doubt I need to target anything less than a P4, but I would like the code not to crash and burn on a P3 at least, I think a P2 or lower just wouldn't be used in the first place. Also how are things going the other way with the pentium D and newer chips? I've read they use a subset of x86 so any danger of the instructions not working on them? Should I look at providing a 64bit version instructions?

I really like the performance benefits i'm getting from this code, but don't want to be restrictive on which machines can run it. I assume there is some method of detecting the chip being run on and switch to different functions using different instructions. Any links you can provide to help with this?

For example i'd be quite happy to check if the chip is P2 or less and throw up an error that it can't be ran on such machines, but how to detect this? Not to mention what to do for AMD chips?

.. so much to learn ;)

i'm still thinking about how i can read RGB (24bits) from the source image in an efficient mannor - see my 3rd post. Do I need to be concerned in reading a byte past the end of the image data?


thanks
Posted on 2007-01-16 07:25:05 by noisecrime
MSVC6? Upgrade, upgrade, upgrade! Also, which command line options? Optimizations even enabled?

The 64bit x86 CPUs have a superset, not subset, of the 32bit CPUs. Even when running a 64bit OS, they can still run 32bit code, so focus on 32bit for now. MMX has been available since pentium1-mmx, so it's pretty safe to use.

Do consider switching your image data to 32bit ARGB/RGBA instead. But if you stick with 24bit, no, you generally don't have to worry about reading a byte past the data end. You need to cross a page (4k) boundary for this to ever be a problem, and additionally the page you cross into needs to be non-present. There are situations where this can happen, but it's extremely unlikely it'd ever happen in a situation like this.

Perhaps you could post a bit more of your code, and tell a bit more about it. Chances are much better results can be achieved if more optimization is done rather than focusing on the innermost of the inner loop.
Posted on 2007-01-16 09:23:24 by f0dder
Of course thats absolutely true, I just thought I'd throw this quote into the mix..


Ten minutes optimizing an algorithm is worth more than ten weeks optimizing an implementation

Posted on 2007-01-16 09:47:59 by Homer

MSVC6? Upgrade, upgrade, upgrade! Also, which command line options? Optimizations even enabled?


Well MSVC6, because it was my first compiler and for ?90 its been the best investiment i've ever made in software terms. Of course its the most crippled version but it got me into writing C++. I do have VC2005 Express edition and gradually i'm migrating over to it, but it causes issues with the xtra XDk and other libraires i'm using so its a slow process.

command line options - what are those ;) I'm afriad i'm not a hardcore programmer, I like my GUI, but as far as i'm aware yes I do have optimised for speed in the settings. However I don't think this really has any bearing on this thread as although i'm using asm to make faster or rather more efficent code than MSVC produces, i'm also doing it just to get to grips with and learn asm in general. Frankly if I hadn't found that MSVC 6 or 2005 could ouput assembly with source I wouldn't even be here now. It gave the oppertunity to study working asm and i've definately made more progress than I did back in 1985 when I was trying to learn assembly/machine code for the Z80 (Zx spectrum) ;)



The 64bit x86 CPUs have a superset, not subset, of the 32bit CPUs.


subset - typo on my part, sorry


Do consider switching your image data to 32bit ARGB/RGBA instead. But if you stick with 24bit, no, you generally don't have to worry about reading a byte past the data end. You need to cross a page (4k) boundary for this to ever be a problem, and additionally the page you cross into needs to be non-present. There are situations where this can happen, but it's extremely unlikely it'd ever happen in a situation like this.


Unfortunately I can't switch from 24 bit as this is what i get from the source (see below for more details). Also although you appear to be saying its ok to read past the end of the data, you also mention its 'possible' for something nasty to happen. This doesn't seam like a 'sound' design as normally if something can go wrong it will ;).  I'm still not fully up on where this data is stored, but I was worried in case it hit a reserved/protected area and caused a 'protection fault'? - showing my ignorance of how things work at low level i'm afriad.


Perhaps you could post a bit more of your code, and tell a bit more about it. Chances are much better results can be achieved if more optimization is done rather than focusing on the innermost of the inner loop.


Note sure code would be useful. For starters i'm working on testbed code, that is a small function that is represntational of the final project but not actually part of it. This is to give a smaller more managable amount of code to convert to asm and provide an easy, quick mechansim for testings each change as I make it. Basicaly i'm passing a image object (32bit) flipping it vertically and putting the result into another image object also 32bit.

I agree that in general algorithm optimisations can be far more productive, but in this case there are literally only 3 or 4 lines of code i've not posted, as the function is as simple as a loop, grabbing source pixels, writing destination pixels. The next stage will probably involve looking at my gamma code, but again thats fairly minimal already, yet will be more complex to convert to asm with my limited knowledge.

The end game for all this 'exploration' is a project i've already written, but want to make faster. Its an xtra (essentially a dll) for Adobe/Macromedia Director that passes realtime firewire camera images from a closed proprietry library into Directors native image object. The camera's themselves don't have built in gamma so i need the option to apply a gamma ramp, the camera images are RGB, but Director is BGRA and I need to have options to flip the source image either/both vertically and horizontally.

I was disappointed with my initial codes performance, so i tired first optimising the C++ code (i.e. using a long to hold all the destination pixel components), there really isn't anything that can be done at the algorithm level since its all very basic stuff anyway, but clearly writing improved asm can and does have an impact within the tight loop. However instead of trying to rewrite the whole thing in asm i've been looking at optomising specific lines or areas of code, which is where this thread started.

I'm not looking for the very best implementattion, just one that gives a decent increase in performance without years of effort, which i've achieved with the help of people from this forum. Gradually i'm building the code back up based around these optimisations, at which point I may need to refactor the the whole thing as better avenues become apparent. But thats the learning process.

Then again i'm also just enjoying working with a new langauge, testing the different approaches given in this thread, learning a little bit at a time. Maybe once I understand this to a better degree i'll post more complete code, but its likely to be quite lengthy which normally means it gets less responces.

Oh one thing I would appricate is any pointers (parden the pun) to efficient methods of flipping an image vertically or horizontally as i've had little luck in finding anything (its all 3d stuff these days online). I've written the dual flip (thats easy just iterate backwards through the source) and the vertical flip  (but that uses two loops, outside loop controls which line, inside controls pixels in the line), but its one of those areas that has been worked on for years by many people (until 3d came along) so i'm sure there are some good standards I could learn from.

thanks for your help
Posted on 2007-01-16 10:38:28 by noisecrime

Well MSVC6, because it was my first compiler and for ?90 its been the best investiment i've ever made in software terms.

It's not bad at all either, it's just pretty old & dated and doesn't generate super-good code - and is crippled if you get into "recent C++". But it's still decent enough, and there's still parts of the VS6 IDE I prefer to the VS2005 one.


However I don't think this really has any bearing on this thread as although i'm using asm to make faster or rather more efficent code than MSVC produces, i'm also doing it just to get to grips with and learn asm in general.

Point taken, and that's not a bad thing to do :). Still, if you're going to continue also using C/C++, do yourself the favour and move to a more recent compiler (it's neat that the VC2005 express edition isn't crippled, but actually has pretty much the full optimizing engine).


I'm still not fully up on where this data is stored, but I was worried in case it hit a reserved/protected area and caused a 'protection fault'? - showing my ignorance of how things work at low level i'm afriad.

Well, memory allocation is generally done at some granularity. It should be safe to assume that this granularity will be at least four, so reading a byte too much shouldn't pose a problem in this situation. If you move to MMX/SSE code, it's best processing only full chunks, and have non-fancy code for the possibly few remaining pixels though.

Also, reading too much is only a problem if you cross a page boundary. Never write too much, though :)

There's probably some hardware-accelerated way to do your flipping with DirectX, which you should look into if you really want maximal speed, but let's focus on optimizing the algorithm instead - that's a bit more fun.

You'll want to try to avoid reading bytes - it's better to read full DWORDs (or larger quantities if using MMX/SSE), manipulate these as necessary to take account for 24bpp, and write out full DWORDs again. 24bpp is such an annoying format though, since it's not evenly dividable be neither 32, 64 nor 128 bits :)
Posted on 2007-01-16 16:06:37 by f0dder
Well i've started to port the project to VC2005, actually at the moment its just the skeleton project for creating Director xtras as i've been meaning to sort it out for some time. Odd thing though the exact same skeleton code compiles to a release version that is almost double that of a MV 6 version? I spent a good hour going through all the properties to see if there was any obvious reason for this, but found nothing. The settings should be the same since it the vc2005 is an import of the vc6 project with a few fixes for depreciated functions and the like.

It seemed worth doing as i've heard many times that vc 6 (at least the version i have) wasn't the best of compilers and vc2005 was meant to be an  improvement. I guess we'll see once i start looking at the assembly output, another reason why its worth doing.


Also, reading too much is only a problem if you cross a page boundary. Never write too much, though


Unfortunately i'm not really aware about page boundaries and the like, its knowledge that i've never had to be concerned with directly, which is also why its somewhat of a concern. I'm sure i'll get to grips with it at some stage.

I did consider perhaps reading 4 pixels worth (at 24 bits) at a time into 3 (32bit) registers -not sure if thats really possible - to avoid the issue of reading past the end of the data, but that adds a whole other level of complexity ;)

Anyway i've got a new vc 2005 project ready to start adding several test functions so i can profile  converting lines of c++ to asm and different methods. So i'm going to start playing with that. no doubt i'll have a whole host of new questions in the morning ;)
Posted on 2007-01-16 17:54:11 by noisecrime

Odd thing though the exact same skeleton code compiles to a release version that is almost double that of a MV 6 version?

The VC2005 runtime is a bit larger than the VC6 one - for instance they removed the single-thread version, so there's only the multi-thread safe version, which is a bit larger. Chances are that your VC6 version also linked against msvcrt.dll instead of static linking?


I did consider perhaps reading 4 pixels worth (at 24 bits) at a time into 3 (32bit) registers -not sure if thats really possible - to avoid the issue of reading past the end of the data, but that adds a whole other level of complexity ;)

It's possible, of course you'll ned to do some shifting around and such, but my guess is that it'll be worth it.
Posted on 2007-01-16 19:18:24 by f0dder
Ok,  whose bright idea was it to switch to using Vc2005?

Just added my first test function, the basic c++ version of the code i'd been using from the start. Compiled and tested it to discover it took 3.5ms! Thats over 6 times faster than the same code compiled in vc 6 and still twice as fast as my asm version ;(

Of course i'm happy with the performance increase and extremely surprised the compiler was able to optimise it so well. Unfortuantely I've found the asm it generates now is much harder to follow, but I guess thats the price for optismised code. I'm not sure its even worth trying to improve on it ;)

Guess i'm sold on using VC2005, though its still tempting to try and write the function myself in asm and see how close i can get in terms of performance.
Posted on 2007-01-16 20:32:27 by noisecrime

Guess i'm sold on using VC2005, though its still tempting to try and write the function myself in asm and see how close i can get in terms of performance.


Trying to beat such a compiler using selective in-line assembly language would probably be a futile task unless the compiler is "horrible" at optimizing said task/function. Remember what I said about the surrounding code context.

However, it is indeed possible for you to redesign your program in assembly language to achieve smaller and faster code execution. The thing you have to ask yourself in these situations is... well... is 10 times the work in rewriting said program to assembly language worth such an insignificant improvement??? I think the obvious answer is no. As you gain more programming knowledge/experience, you will learn when and where to pick such "battles" ;)

In general, with compilers like VC2005, in-line asm is probably obsolete for anything less than the needed low-level instructions like those used in OS Development.
Posted on 2007-01-16 23:44:37 by SpooK
Well, even with decent compilers like VC2005, the Intel C++ compiler and GCC 4.x, you can almost always still beat the compiler. Sometimes it won't be a massive improvement and will be pretty useless except for learning value, other times (especially in the domain of graphics and sound, and especially if you move to MMX/SSE) you can still get some really significant savings.

Even with some algo restructuring (without assembly) you can probably boost it even more, and I'm still convinced you could really get somewhere by "thinking in assembly" - so don't give up yet, but enjoy the ride :)


In general, with compilers like VC2005, in-line asm is probably obsolete for anything less than the needed low-level instructions like those used in OS Development.

I'd have to agree with that - you usually get the biggest improvements by re-writing an entire function and shoving it off to external assembly. That lets you use the same C++ and Assembly code even though you switch to another compiler; use FASM, YASM, or NASM for the external assembly since those assemblers run on a variety of operating systems.
Posted on 2007-01-17 02:49:36 by f0dder
Thanks for the comments guys.

I have to admit i've gone from mild euphoria to being rather despondent at this discovery and that VC2005 is now the best free software i've ever got, i will never to use MSVC 6 ever again ;)

I have to agree with all the points made in the last two posts. I'm really amazed at how well the latest compilers work and i've not even begun to explore  some of the new compiler options for speed yet.

From what I can see it looks like the VC6 compiler I had did it almost exclusively on a line by line basis, whilst vc2005 can examine several lines at a time, the function in its entirity and even the whole program. So VC6 was easy for a novice such as my self to read and understand the generated asm, but VC2005 is going to be far more of a struggle.

Mind you in doing some more tests - adding the R and B swap in the C++ code, VC2005 isn't so hot, going up to 5ms, whilst the code developed through this thread goes down to 5.6ms. So there might stil be hope of writing an entire asm function that improves on the compiler.

However its somewhat of a mute point as the function i've been testing on isn't actually going to be used, it was just a simple test. Several times I was asked about the function in case the algorithm or the actual code could be improved. I'm not entirely sure it can, there are some obvious changes that could be made, but as such i can't see them making a huge difference, but how knows perhaps i'll learn something new. So for sake of completeness i'll post the function, but we are rapidly moving away from talking about asm.


MoaError TStdXtra_IMoaMmXScript::ncp_FlipVertical_BasicCPP(byte* tSrcImagePtr, byte* tDstImagePtr, MoaLong iWidth, MoaLong iHeight)
{
MoaLong i,x;
MoaLong iRowBytes    = iWidth*4;
MoaLong iImageBytes  = iWidth*iHeight*4;
byte bRed, bGreen, bBlue;
unsigned char*  src;
unsigned long*  dst;

dst = (unsigned long *)tDstImagePtr;
src = (unsigned char *)tSrcImagePtr;
src = src + iImageBytes-iRowBytes;

// Loop through each line
for (i=0; i<iHeight; i++)
{
for (x=0; x<iWidth; x++)
{
// Extract the current RGB values - eventualy this will be on a24 bit RGB values no alpha
bRed   = *src++;
bGreen = *src++;
bBlue  = *src++;
*src++; // skip alpha

// Write to dst
*dst++ = (unsigned long)( (byte)(bBlue) | ((byte)(bGreen) << 8) | ((byte)(bRed) << 16));
}

// Decrement src pointer by a line
src = src -iRowBytes - iRowBytes;
}

return kMoaErr_NoErr;
}


Anyway I think for now i'll start writing up the proper functions and seeing what the results are from that. I'm pretty sure the design and structure of the code will have an effect on its performance, but its no longer asm so I don't think I could continue the discussion about it on this forum (as much as i'd like to). I might start a new thread if there are parts of the asm VC2005 generates which I think could be improved though.

anyway thanks again to everyone who replied, its been most informative and hopefully I find some things to do in asm inthe future as its quite fun.
Posted on 2007-01-17 09:05:50 by noisecrime