Hi Board!

I am new in ASM and here is my first stupid post. :-)
For sure I know that no programming language is faster than ASM - if the code is written in a good manner. But for example what is with a normal loop like

(in VB)
for i = 1 to 100000000
j=j+i
next

Why is a VB or C++ compiler not able to translate such a standard loop into the same machine code like ASM does? Or the other way round: What "extra" code send a compiled VB loop to the cpu?


Best regards from Germany,
Marc
Posted on 2004-06-15 22:09:02 by marc waesche
for alot of algorithms and stuff, a good modern optomised compiler can produce very good assembly language (some they know alot of tricks these days, sometimes making faster assembly language that i would naturally) for some things. however often they don't optomise things as well as you'd hope. why is assembler faster, the main reason is a human is more intelligent and (if htye know what they are doing - and i'm just a beginner on modern processors in this ) can optomise cleverly.. also a human can go over the inner loop again and again, trying to make it go a little bit faster by doign it a little different everytime

and also in assembler a human can use MMX and other such technologies, when most compilers don't use them, or can't use them very well..
a compiler just knows code, it can optomise that according to limited (though quite amazing) techniques these days
a human knows the algorithm as well, plus knows which parts of the code are important to optomise and which don't matter
Posted on 2004-06-15 22:14:01 by klumsy
The first thing to realize is that assembly (in itself) isn't faster - but that you can (often) get stuff faster by writing it yourself, in assembly.

Why? HLL's operate from a set of rules, heuristics, call it what you will. These are generic rules, that have been designed to make most stuff work well. And indeed, most of the time modern compilers generate pretty good machine code. However, if you really know your problem and your target CPUs, there's specific situations where you can beat the compiler - because it's such a specific thing that doesn't go well with your compiler's generic rules.

This is especially true if your algorithm can be vectorized to use MMX, SSE/2/3, 3DNow, or whatever SIMD instruction set your target CPU supports. These are all relatively new instruction sets, so even the compilers that can automatically optimized for these can be pretty easily beaten by humans (so, for the time being, /arch:{mmx,sse,sse2} or whatever switches shouldn't be viewed as much more than a quick-and-dirty way to, perhaps, squeeze out a little extra performance, until you get the time to hand-optimize.)

Anyway, back to my first claim: assembly doesn't make things faster by itself - it requires thinking. If you code simplistically, using MASM-style if/else, simple coding style like the iczelion tutorials etc., you're going to be beaten by most modern compilers. If you want good (executable) code, you must write good (source) code. This is true for HLLs too btw., it _does_ matter how you construct your for loops, how you access your arrays, how you represent your data.

PS: the VB example you gave is a bit funny... most intelligent compilers would remove those lines completely, since you're not using the end-result of i. Even if you used i, some of the good compilers might be able to remove the computations with a constant.
Posted on 2004-06-15 22:47:22 by f0dder
In short, a human applies logic and intelligence. A compiler applies a set of rules designed by humans. Through logic and intelligence (experience, insight, knowledge, etc) a human can derive new rules. A compiler can not.

More abstract: writing optimal code is a very hard problem (non-polynomial complexity). The only way to find the optimal code is to try out all possible variations of instructions and sequences, and pick the best one. Since that will require almost endless compiling time, an approximation of the optimal code is used instead. This is driven by heuristics. These heuristics may work well in most cases, but they are never guaranteed to find the optimal code. Neither is the human approach, but it does get better results in some cases, because it approaches each case separately, and in a more advanced way.
Posted on 2004-06-16 02:22:31 by Scali
Thanks to klumsi, f0dder and Scali! That were some interesting informations for me.
Scali, how does that work with the heuristics? Is there kind of tutorial available for that?

Best regards,
Marc
Posted on 2004-06-16 09:26:27 by marc waesche
marc, it's a _very_ complicated topic, so you won't find any "tutorials", I'm afraid. There's bound to be a lot of complicated research papers on the matter, though... And probably a bunch of proprietary information as well :)

Perhaps you can find some info (or at least words you can search for) by looking at the GNU site for the GCC compiler - it's probably hopeless trying to read the source though, and even more hopeless trying to understand it, if you don't have compiler experience already.

If you want to look a bit at some extremely simple compiler, google for crenshaw+compiler :)
Posted on 2004-06-16 09:49:20 by f0dder
Assembly is not faster, and neither is C or C++ or any other compiled language (I am excluding JITs here), speed depends solely on the CPU architecture and how well the compiler exploits it, a language in itself cannot be faster. Some optimizing compilers will do a very good job of making full use of the tricks and rules to get every advantage but in the end you are limited by the quality of the compiler and how many "special cases" were allowed for when writing it. With assembler there is no such limitation, you control the optimizations yourself and therefore a very good assembly program can outperform an excellent compiler, conversely an excellent compiler will usually outperfom a fairly good assembly program. Many modern compiler systems allow you a certain level of control over the types and aggresiveness of optimization, this allows them to further reduce runtime and make more efficient code. If you are using assembly solely for speed you might want to think about inline assembly for the critical parts of your code, if you are using assembly because you like the simplicity and structure then full assembly applications are for you.
Posted on 2004-06-16 14:01:50 by donkey
donkey, a common misconception.
As f0dder stated, a high level language defines rules that can affect how and what kind of code is generated. For example, BASIC has array bounds checking, whereas C doesn't. This means that access to the array will take more time in BASIC than in C (we consider both implementations optimal), if the element being accessed is not known until the access itself.
Posted on 2004-06-16 14:11:30 by death
Hi Death (I have always wanted to say that)

I believe that would come under "the CPU architecture and how well the compiler exploits it". You could just as easily write a BASIC implementation that did not do bounds checks. A language is a syntactical representation of instructions, it cannot be faster, the only thing that can make a program faster is how the compiler interprets it and translates it into machine readable code, nothing else. BASIC is a language, there are many compilers that produce vastly different code for the same script so it is obvious that it is how well the compiler does that makes the difference here, not the language.
Posted on 2004-06-16 14:16:10 by donkey
Hello donkey,
Not really. Once you drop out the bounds checking, it's no longer BASIC. By definition, BASIC has array bounds checking. The rules are part of the language, not part of the compiler/architecture specification.
Posted on 2004-06-16 14:21:15 by death
I would have to see who actually decided that it was required. I have looked at some vastly different versions of BASIC. Would you consider a version without OOP models or ActiveX a different language ? I did a google for requirements in writing a BASIC compiler and could find no reference to bounds checking at all, actually I couldn't find any formalized standard at all so I am undecided on the issue. However, it sounds more like something you expect from the compiler than a formal requirement.
Posted on 2004-06-16 14:29:49 by donkey
Wow, all that sounds for me that Assembly is a very interesting thing! :-) But I am afraid the starting point to "go into the stuff" is not really easy. I read a book but in opposite to a VB book it is not easy to start writting code just after reading the book. I could change the described code examples a little bit but....just a little bit. To start writting a complete different code than explained it seems to be neccessary to read some more books. :-) I prefer learning by doing but how? I understand all of what you kindly replied to my question but it is kind of abstract.
For example: I want to write a code that contains a loop that counts from 1 to 10000000 and every number should be written in a text file followed by a new line command. What would be a good way to do this with a P4 2,53 GHz? Should I use the fpu? SSE commands? Or may be it is better (and possible) to use a part of the grafic card?s gpu (GeForce TI 4200)? I read some month ago that some calculations were faster made by a grafic chip. No, I do not talk about 3D grafic rendering. ;-)
So, what would be the first steps? I would like to start learning by doing with a code that does that job decribed above so I could try to manipulate the code here and there to see if it is faster or slower than before. For that I would need a timer too that shows me the time the code needed to do the job. The only problem for me is to write the first (more or less) good code to do this.

Best regards,
Marc
Posted on 2004-06-16 14:30:54 by marc waesche
Hello donkey,
Then I shall provide another example. Consider C89's aliasing rules versus FORTRAN's.
Posted on 2004-06-16 14:34:10 by death
Hi Death,

Again aliasing is something that is internal to the compiler, there is no aliasing in the final machine code. This is another example of how well the compiler interprets the source code and converts it into machine readable code. You are confusing the compiler implementation with the language syntax, they are two absolutely different things, otherwise every compiler would produce identical code for the same source. For example I can write a line of DonkeyC (no it doesn't exist) and have my compiler generate anything I want as long as the end result is that the code does what the programmer expects . A good C compiler will take Switch/Case and make it a mix of jump table and comparative jump chain based on how many holes there are in the table and how adversely it will effect the cache. However some will only implement it as a comparative jump chain or always as a jump table. This is where the quality of the compiler and how well it exploits the CPU comes in, syntax is only for the compiler to interpret, it has nothing to do with the machine.
Posted on 2004-06-16 14:44:31 by donkey

Wow, all that sounds for me that Assembly is a very interesting thing! :-) But I am afraid the starting point to "go into the stuff" is not really easy. I read a book but in opposite to a VB book it is not easy to start writting code just after reading the book. I could change the described code examples a little bit but....just a little bit. To start writting a complete different code than explained it seems to be neccessary to read some more books. :-) I prefer learning by doing but how? I understand all of what you kindly replied to my question but it is kind of abstract.
For example: I want to write a code that contains a loop that counts from 1 to 10000000 and every number should be written in a text file followed by a new line command. What would be a good way to do this with a P4 2,53 GHz? Should I use the fpu? SSE commands? Or may be it is better (and possible) to use a part of the grafic card?s gpu (GeForce TI 4200)? I read some month ago that some calculations were faster made by a grafic chip. No, I do not talk about 3D grafic rendering. ;-)
So, what would be the first steps? I would like to start learning by doing with a code that does that job decribed above so I could try to manipulate the code here and there to see if it is faster or slower than before. For that I would need a timer too that shows me the time the code needed to do the job. The only problem for me is to write the first (more or less) good code to do this.

Best regards,
Marc


first rule for asm: only use asm if it realy matters (if speed or code size is needed) else use c or c++ or whatever u like. The main problem with asm is that its not realy "readable" if u did not write the code even with comments asm is always damm hard to understand. Since the new c/c++ compilers do a very good job u only need pure handcoded ASM in some inner high count loops. For normal stuff it dont matter if u wait 0.00001 sec or 0.001 sec until a loop or function call is rdy.

If u want start with asm i consider to start with mmx2 or SSE1/2 instructions since this is what realy can speed up things and its also the only thing a compiler cant do and will always have problems with.

Thats why im nearly always use inline asm since its a very nice addition to the c++ language and i can fully optimize what needs to be optimzed. It is nonesence to try optimize all in asm if 99% of the cpu time is just spend in 1 loop... so just rewrite this loop and leave the other in teh higher language.

BTW, with the upcoming 64bit versions it will be more and more importand to use special compiler intrinsic rather than asm or inline asm. For example MS dev studio 2005 will not support inline asm in 64bit mode. If u write a asm routine for 32bit u have to rewrite it for 64bit (using the more registers), but if u use the correct compiler intrinsic the compiler will choose the regs and the code just need to be recompiled and will run on 32bit and 64bit.
Dont get me wrong i tryed use the latest intrinsics from visual studio 2005, but my handcoded inline asm was still 200% faster. I still miss some settings for the intrinsics to be realy comparable with inline asm or pure asm files.

It also seems that the gcc compiler with At&T inline syntax has a advantage here over the upcoming VS 2005

PS: im realy pushing to the windows 64bit edition, since damm what i could do with 8 more general regs and 8 more SSE2 regs :) i could rewrite many of my routines and they would perform much faster. The irony is that i dont care about this marketing 64bit crap i dont need 64bit i just want all those extra regs :)
Posted on 2004-06-16 14:50:26 by Andy2222

Wow, all that sounds for me that Assembly is a very interesting thing! :-) But I am afraid the starting point to "go into the stuff" is not really easy. I read a book but in opposite to a VB book it is not easy to start writting code just after reading the book. I could change the described code examples a little bit but....just a little bit. To start writting a complete different code than explained it seems to be neccessary to read some more books. :-) I prefer learning by doing but how? I understand all of what you kindly replied to my question but it is kind of abstract.
For example: I want to write a code that contains a loop that counts from 1 to 10000000 and every number should be written in a text file followed by a new line command. What would be a good way to do this with a P4 2,53 GHz? Should I use the fpu? SSE commands? Or may be it is better (and possible) to use a part of the grafic card?s gpu (GeForce TI 4200)? I read some month ago that some calculations were faster made by a grafic chip. No, I do not talk about 3D grafic rendering. ;-)
So, what would be the first steps? I would like to start learning by doing with a code that does that job decribed above so I could try to manipulate the code here and there to see if it is faster or slower than before. For that I would need a timer too that shows me the time the code needed to do the job. The only problem for me is to write the first (more or less) good code to do this.

Best regards,
Marc

Yes the GPU can do calculations faster than the general-purpose microprocessor, usually when you pit an ASIC against a general purpose CPU, the ASIC usually wins no matter what disadvantages it has, this is because the ASIC was specifically designed to perform that task whilst the general-purpose CPU is designed to handle a multitude of tasks and not dedicated to one task.

I remember reading once that some guys on comp.arch.fpga made an ASIC at ~100Mhz and it was outperforming a Dual Xeon system at the same task.
Posted on 2004-06-16 14:54:21 by x86asm
Hello donkey,

Again aliasing is something that is internal to the compiler, there is no aliasing in the final machine code. This is another example of how well the compiler interprets the source code and converts it into machine readable code.

Are you sure we're refering to the same aliasing problem? I recommend you do a bit of research before your next reply.


You are confusing the compiler implementation with the language syntax, they are two absolutely different things, otherwise every compiler would produce identical code for the same source.

You don't seem to differentiate between a language and a language's syntax. They are different.


For example I can write a line of DonkeyC (no it doesn't exist) and have my compiler generate anything I want as long as the end result is that the code does what the programmer expects .

That won't be a C compiler. A C compiler should conform the rules defined by the C standard, not 'what the programmer expects'.
Posted on 2004-06-16 15:00:18 by death
Scali, how does that work with the heuristics? Is there kind of tutorial available for that?


You could go to the library and pick up a book about compiler design, I suppose, if you want to get deep into the subject.
In short, heuristics are just rules that 'usually' apply. For example, if the compiler encounters y = x*11, it can replace the division with a shift and adds (y = (x << 3) + (x << 1) + x) and get faster code.
Why this example? Because it is a doubtful one. A single shift and add combination may be faster, but if you have a number of them, you have to execute many dependent instructions, while multiplies would pipeline, and effectively be faster. So that's the problem of heuristics. They don't always generate the best code.
Posted on 2004-06-16 15:08:55 by Scali
For example: I want to write a code that contains a loop that counts from 1 to 10000000 and every number should be written in a text file followed by a new line command. What would be a good way to do this with a P4 2,53 GHz? Should I use the fpu? SSE commands?


That would be a perfect example where the quality of the code (I mean generated code, assuming an optimal design) is not of big influence. The device that the textfile will be stored on (I suppose HDD?) is orders of magnitude slower than the CPU, so the faster your code, the longer you wait on the device.
Posted on 2004-06-16 15:15:12 by Scali
Are you sure we're refering to the same aliasing problem? I recommend you do a bit of research before your next reply.


I assumed that you meant aliasing, which in FORTRAN (the context in which you used it) is defined as ...

From the Sun Microsystems FORTRAN programmers guide Chapter 7

Aliasing occurs when the same storage address is referenced by more than one name. This happens when actual arguments to a subprogram overlap between themselves or between COMMON variables within the subprogram. For example, arguments X and Z refer to the same storage locations, as do B and H:


If you have a different definition for Aliasing I am sorry I misunderstood, I used the actual definition. There are no labels in machine code, they are just a memory address nothing more. And just to show that it is a compiler issue, there is the following note in the manual regarding aliasing

The results on some systems and with higher optimization levels could be unpredictable.
Posted on 2004-06-16 15:42:00 by donkey