Hi all!

Well... I know the great bitRAKE's job, but... There's my alternative SWITCH/CASE construction. Since it can be simply done using .IF/.ELSEIF/.ELSE/.ENDIF, it's simply making use of benefits provided by these macroinstuctions.

IMHO (for this moment), there's one disadvantage of using proposed set - an extra "garbage" code (extra JMP instruction) following each .CASE terminated with "ret" (see attached WndProc example). It's a specific case of use, though.

The whole set should be about 90% "idiot-proof" - so i hope... Feel free to use it and comment any possible improvements, if you find one...

Regards, Mikael

================
BTW. Don't care about "HLX" prefix... Provided macros are just a part of developed (well - foreseen at least) library, so the prefix is a simple acronym for "High(er) Level Xtension" :)
Posted on 2003-04-18 15:15:59 by MikaelC
erm.. look, I think asm should evolve, I'm all for extending the HLL directives etc, I'm just not so sure I see any improvement beyond that offered by the IF/END/ELSEIF/ELSE/ENDIF directives in this instance.
When I looked at you example source, what sprung to my mind obviously was firstly how similar it looks to C (not that theres anything wrong with that lol) and secondly what struck me was how unreadable I found it as compared to nested IF based logic. Really it amounts to the same thing, am I showing a prejudice? Probably, but all the same, when I was a kiddy learning to code, we only had two langs, BASIC and 8 bit machinecode... I guess I prefer the more human approach of BASIC in being a language based on grammar and syntax, and I like the clarity afforded by properly tabbed nested IF logic under masm. Shoot me :)
Your work is excellent but your efforts are probably in vain :tongue:
Posted on 2003-04-21 07:29:34 by Homer
I suppose if you can produce some code like


mov eax, dwValue
cmp eax, 1
je caseOne
cmp eax, 2
je caseTwo
cmp eax, 3
je caseThree

; fill in more

jmp caseError

caseOne:
call func1
jmp caseDone

caseTwo:
call func2
jmp caseDone

caseThree:
call func3
jmp caseDone

caseError:
; value is less 1 or greater 3

caseDone:
; go on here

The code will the faster and shorter than using .if, .endif, .elseif, .else
Personally I would prefer such code as will mantained labels have no need for comments. Have you ever seen Privalov's code? No comments but have labels that are will mantained. I tend to agree with Evilhome2k's opinion that you are reinventing the wheel. Nevertheless, good job :alright:
Posted on 2003-04-21 09:13:10 by roticv
Well... Thanks for replies... As i think about it a little bit harder, you both are right. What pushed me to construct this macros? Hmmm... It's a kind of laziness :) Sometimes even we (asm coders) are tired with writing the same code Nth time. I was tired with some of common-type code i was to write lately, so i decided to find a shortcut. There were two ways: more and less complicated. This is the less complicated construction - so i think. The first attempt was to build the whole set on my own (without use of .IF/.ELSEIF/a.s.o.), but then... I realized that these "native" macroinstructions do the very most of my job (especially - 1/ they "optimize" comparisons with zero values automatically; 2/ they "optimize" jumps to near range automatically).

First of all - why macros? Simple: I was tired with a long chains like:



cmp arg,val1
je Label1
cmp arg,val2
je Label2
cmp arg,val3
je Label3
...


The second thing... Why SWITCH? Because... It's a little bit shorter than .IF/.ELSEIF. :) :) :) I do not have to repeat the name of compared argument and type of comparison function (.IF arg ==, .ELSEIF arg ==, a.s.o.) all the time. On the other hand - this macro should provide an easy way of checking required argument range without constructiing tabled loops or simpy hard-code. Suppose i'm writing a procedure, and i want to check passed argument against a given set of values. Suppose this argument must be equal to one of: 200,240,400,480,600,768 (do you recognize this values?). IMHO - it is a bit more handy to code it like this:



ProcedureA PROC mWidth:DWORD, mHeight:DWORD

.SWITCH(mHeight)
.CASE(200)
.CASE(240)
.CASE(400)
.CASE(480)
.CASE(600)
.CASE(768)
.DEFAULT
.RETURN(ErrorCode)
.ENDSWITCH
... further code


Don't you think so? There's one thing, I'd like to emphasize. This macros have been constructed because of all API and COM handling routines, and I prefer using it in non-critical sections of code.

Conclusion: this time simplicity - not speed and size :) Just for the same reason that makes me using .ONZ(Label)/.ONNZ(Label) macro instead of "or eax,eax / jz Label" or "or eax,eax / jnz(Label)", and .RETURN(Value) macro instead of "mov eax,Value / ret".

I cannot deny - changing the choosen solution might be a question of time. And that's why I'd like to thank you for comments, suggestions and all the kind words of yours :)

Regards, Mikael
Posted on 2003-04-21 13:39:03 by MikaelC
MikaelC,

I downloaded the code you posted a few days ago but I did not have the time to write a test piece to see how it works. I already have in MASM32 Greg Falen's very good switch macro and I have seen but not tested out Bitrake's version so I am in fact interested in what you are doing.

What I would suggest is making a small working demo EXE program that shows the switch design you have used as it makes it a lot easier to evaluate and more people will see how your design works.

Regards,

hutch@movsd.com
Posted on 2003-04-21 21:04:30 by hutch--
hutch--, good idea, i'll provide an example of executable asap...

Regards, Mikael
Posted on 2003-04-24 16:30:11 by MikaelC
Imho, switch is preferable to if/elseif - that's personal preference.

As for the implementation, here's my thoughts. Just thoughts, lots of people will probably bitch.

For unequal distribution input, with not too many input values, the cmp/je chain is probably best, with the most frequent hits near the top.

With a flat distribution, I'd say the two best approaches would be either a jump table, or a binary search approach like what C compilers usually generate, and which bitrake cleverly implemented. Jumptable will probably perform best, unless you have a _large_ set of possible input values, which would require a large table that might not fit into cache.

Binary search has bad branch prediction, but has the advantage the remaining items to search is cut in half for each branch. If you use a compiler with profiling, some of them can make unbalanced trees that are sorted by hits, which should give even better performance than cmp/je if I am not mistaken.

Back to the jumptable! The whole cache issue is mostly if you plan on using this for very localized code snippets (ie, where you could potentially have the jump table in cache - shouldn't be possible at all for stuff like window message dispatching, no matter jump table size). A byterange table would probably be a good performer - 1024 bytes, easy to keep in memory.

If you aren't expecting to be able to hold the jump table in cache, you'll get the penalty of a full cache line load + not being able to predict the jump address long time ahead. Not that severe, but taking into account jump table size for a large range of input values, this might not be acceptable.

Where speed matters, you should ponder about the code characteristics and do timings, and choose what performs best.

Where speed doesn't matter (this especially include windows message dispatching), chose whatever you feel comfortable about. I'd probably use a switch statement, and not care much about the implementation.

I would think there will be some performance overlap of the methods, as well as some methods will be either poor or good depending on the input range etc. I'm not sure where/if binary-tree-Jcc will beat cmp/je, but it might.

As always, do timings of your own.
Posted on 2003-04-24 17:00:27 by f0dder
The forms of branching that are implemented as pseudo high level .IF syntax or the topic here of switch/case branching are fundamentally done for clear coding purposes where complex logic is involved.

As a performance issue, any sequence of CMP/Jxx is simply too slow to be competitive when the number of branches can range between hundreds and thousands of options.

I have in the MASM32 package a very high speed branching technique that used a standard Wndproc as an example of how to do branching in a much faster way. Apply this technology to things like a 256 characters branching range and the advantage is clear if it is a true performance matter.

The differences between .IF blocks, switch blocks and cmp/jmp despatch code is trivial and the advantage of pseudo high level constructions still remains in the clarity of the code.

Regards,

hutch@movsd.com
Posted on 2003-04-24 19:49:39 by hutch--

Apply this technology to things like a 256 characters branching
range and the advantage is clear if it is a true performance

matter.

Yep, I doubt anything will beat a jump table if you're dealing
with the byte range. Amen to that! With 16bit range, tables start
to get large (2^16 * 4 == 256k - I doubt this would fit nicely
inside your cache), and often you wouldn't need all the input
ranges and thus waste a lot of space. But certainly, jumptables
should be the best performers - "up to a certain size".
cmp/je, ordered by frequency, should have good branch
predictability. I guess this would be the best solution if you
have input values across a wide range, but with a lot of holes
(think values like {1, 100, 300, 5000}, but across the DWORD
range.)

Iirc, jump tables can't be branch predicted at all, and thus
incur some performance penalty. However, that should also mean
they don't take up a whole lot of branch prediction buffers,
which the linear cmp/je or binary-search method would suffer
from. So when to avoid a jumptable would be the highly scattered
scenario from above - you'd use a lot of space for the jumptable,
filled with holes, and you'd end up loading a lot of mostly

unused cache lines, where a good portion of the branching code
might fit in cache.
If you have a _lot_ of values like this, then it's where it
should make sense to use the "divide and conquer" approach of a
binary-tree based cmp/Jcc.
I can't really think of a high-performance-necessary situation
where I'd be working outside the bytesize range, and thus could
use a jump table. Nor of a scenario where I'd have a big bunch of
input values scattered across the dword range - but I'm certain
those scenarios exist (no, wndproc is not a good example -
windows messaging slow enough that it shouldn't matter at all
which method you use).


The differences between .IF blocks, switch blocks and cmp/jmp
despatch code is trivial and the advantage of pseudo high level
constructions still remains in the clarity of the code.

The differences between .IF blocks and CMP/JE are trivial - but
IF/ELSEIF generates something equivalent to


cmp eax, value
jne nextcheck
<code>
jmp skip


The CMP/JE approach would be


cmp eax, value1
je @@handlevalue1
cmp eax, value2
je @@handlevalue2
cmp eax, value3
je @@handlevalue3
(etc)

Somewhat less JCCs, all branching done at one place with code move
further down (some people find this easier to read). Whether it will
perform better depends on how your processor handles branching - but
iirc, some of the "clever people" on this board have been saying that
CMP/JE is optimal.

If you like the switch syntax, then it has an advantage over all the
other methods - you can change the implementation without too much
fuzz. With this recent macro addition, I think there's switch macros
for IF/ELSEIF, CMP/JE, and bitrakes divide and conquer approach - not
too bad, huh? Easier than converting between IF/ELSEIF and CMP/JE :).
Posted on 2003-04-25 02:23:40 by f0dder
f0dder,

You can assume a couple of things with branching, unless it is repeatedly used, it will not be predicted properly so assuming that an array of addresses can be predicted is wrong anyway. Most probably every jump will be mispredicted so you will get a penalty there.

The reason why you use a single unpredicted jump is to avoid a long line of other unpredicted jumps which will be much slower again apart from the extra compares which slow it ever further. Now put this into the context of what you can store in a branch prediction buffer and you have no chance of storing hundreds of branch options there.

When you are not trying to work in critical wide range branching a switch block or .IF block has advantages in terms of coding complex logic. The despatch technique still has to terminate the branch and end up somewhere after the original branch so you don't get anything for nothing.

You can save a few dead jumps that are never taken by manually coding a sequence of compares and jumps but its only a size saving and it suffers from all of the problems of multiple labels an targetted jumps that a pseudo high level operation doe not have problems with.

Regards,

hutch@movsd.com
Posted on 2003-04-25 03:36:08 by hutch--

You can assume a couple of things with branching, unless it is repeatedly used, it will not be predicted properly so assuming that an array of addresses can be predicted is wrong anyway. Most probably every jump will be mispredicted so you will get a penalty there.

Well... iirc, for pentium4, the CPU will take a guess if the branch hasn't been predicted yet. Iirc, the default is to follow the branch. Of course this is expensive if the branch isn't taken. However, I also _think_ that this would make the cmp/je good for a moderate amount of values, with large holes (making a jumptable wasteful), if they have uneven distribution of input values - you would place the highest occurring value test first, so even if the branch prediction is unknown, the CPU will guess and take the branch.

Appearantly you must not have too many branches before this could totally trash your branch prediction buffers. Or rather, you must not often fall through a lot of branches - thus this is best used when the distribution of input values is (very) uneven.

I am not 100% sure about the branch prediction rules, though, so I would have to read up on the processor manuals to be sure. But I think Svin and other clever people have said the CMP/JE is good, so perhaps I'm not too far off :)


The reason why you use a single unpredicted jump is to avoid a long line of other unpredicted jumps which will be much slower again apart from the extra compares which slow it ever further.

Exactly! I doubt anything will beat a jump table if you have to handle the byterange, especially if the input value distribution is mostly even.


Now put this into the context of what you can store in a branch prediction buffer and you have no chance of storing hundreds of branch options there.

Can't remember the size of the branch prediction buffers, but they certainly _are_ size limited - which is why you will want to avoid going through too many branches. Again, this can be solved with a jump table with a single unpredictable :) branch, but again this is not always feasible. 256kb to handle the entire 16bit range seems like quite a mouthful, especially if it will be mostly empty.


When you are not trying to work in critical wide range branching a switch block or .IF block has advantages in terms of coding complex logic. The despatch technique still has to terminate the branch and end up somewhere after the original branch so you don't get anything for nothing.

nope, don't get anything for nothing. Whether you prefer IF blocks or switch is a personal coding preference (I personally like separating the value tests from the handling code, gives me a better overview of the values I'm handling). Using switch does have the advantage over IF blocks that you can change the way the compares are done, which could be beneficial.


You can save a few dead jumps that are never taken by manually coding a sequence of compares and jumps but its only a size saving and it suffers from all of the problems of multiple labels an targetted jumps that a pseudo high level operation doe not have problems with.

humm.. uncoditional jumps (like masm generates) shouldn't be a problem, I don't think a BPU would have problems with them, so yes that's merely a size optimization. However, if my fading memory about "default action if branch outcome is not yet predicted" is right, the CMP/JE approach should be better than the CMP/JNE masm's IF blocks produce? I might be wrong about the "default action" though.

One thing that _might_ be bad about IF blocks, though, is the interleaving of compares and code. You could possibly end up loading a lot more cachelines, I think. With the "large block of CMP/JE", a large block of the code goes in a cacheline.

My Theory: If the distribution of input values is mostly even, the CMP/JE approach would often take you down the entire chain. This would be bad especially for a large number of compares, as you would get a lot of mispredicts, and constantly overflows the branch prediction buffers. Here it would of course be very good to use a jump table if you can - but if your input values are spread across the 32bit range, that isn't possible. If they're spread over the 16bit range, you might have an unacceptably large jump table. That's where the binary search divide and conquer method would come in handy; you'd still get a lot of branch mispredicts, and possibly also keep overflowing the branch prediction buffers, but on average you should go through a lot less compares; each failed branch would cut the remaining compares to be done in half. I'm sure there's a fancy math name for this :)

Nice to see that it seems we can have a peaceful discussion about this.
Posted on 2003-04-25 04:20:25 by f0dder
Think of it this way, your branch prediction buffer will not hold large numbers of preferred addresses and while you can often write the code in such a way as to make the fall through the default action, not all code can be written that way and where you take a jump it will be stored in the branch prediction buffer until next time or until it is displaced by other stored jump locations.

Now when you use an array of addresses, you restrict the unpredictable jump to one location and while it is slow, it is only slow in one place and not a long chain of unpredicted jumps.

If you have to scan single characters from a unicode set you have a different problem in that the values may span the 64k range so you may be bound to do this branching in a much slower manner of cmp/jxx but if you can contain the range to a subset of the 64k range, you can still end up with a much smaller size which is viable to use as an array.

Now in comparison, if you ever had to handle 64k of branching which is highly unlikely, the sequence of CMP/Jxx would be enormous and the performance would take a severe hit just running through it so it would probably be still viable to load an array and branch that way.

This much is that as the range gets larger, the array of address to jump to gets progressively faster and while its viable to handle a few hundred in something like a WndProc as its slow anyway, do the same with that range in a speed intensive algo and you are losing and enormous amount of time when repeatedly running the code. My own testing on that design on a 256 character range showed a speed increase better than 8 times and it was not highly optimised code.

What my comment was between despatcher style jumps to locations further down a Wndproc and a pseudo high level setup like .IF or a simulated SWITCH block that encased the code within the bloack is that the timing difference is trivial but it can be a genuine mess to code, especially if their is nesting at some stage further down the procedure that .IF syntax handles automatically.

Regards,

hutch@movsd.com
Posted on 2003-04-25 11:55:48 by hutch--

Now when you use an array of addresses, you restrict the unpredictable jump to one location and while it is slow, it is only slow in one place and not a long chain of unpredicted jumps.

Exactly... I think I stated the same (differently written though) in a previous post. Jumptables can be very nice.


Now in comparison, if you ever had to handle 64k of branching which is highly unlikely, the sequence of CMP/Jxx would be enormous and the performance would take a severe hit just running through it so it would probably be still viable to load an array and branch that way.

Yep. 64k different branches aren't something you see very often, and if you did you would probably do something quite different than switch/IF-block. But think about another scenario: 256 possible input values, but spread across the WORD (or even worse, DWORD) range. You can't easily construct a jumptable for this (not for 32bit anyway - and you'd waste huge amounts of memory for the 16bit example). With CMP/JE you'd have at max 256 CMP+Jcc pairs, and fallthrough to default case if not. In the case of even distribution, you could use the binary tree approach, and suffer a maximum (log(256)/log(2))=8 Jcc's before you hit the correct value (correct me if I'm wrong). Sure, 8 badly predicted branches are bad... but it beats 248 badly predicted branches :)


My own testing on that design on a 256 character range showed a speed increase better than 8 times and it was not highly optimised code.

And again, I too would recommend a jumptable when you have a "not too large" and "not too spread" amount of input values to handle.



I personally prefer dispatch-style coding. Whether I do jump tables, if/else, or switch style stuff, I find "dispatching" a lot cleaner than the messy mingling of control cases and code, and a lot easier to maintain too. For C, I've used used jumptables in the past and will use them again. For larger switch/case-style stuff, I usually branch off to functions instead of handling it within the switch/case logic. But that's a matter of personal preference, and not really anything to discuss; "whatever floats your boat" :).
Posted on 2003-04-27 13:36:37 by f0dder
Oh, Scali bashed my head a bit :-) - the following sounds reasonable to me.

The problem with jumptable is worse than mispredicting a branch - quote scali: "the problem is that the jump target is not known until the value is loaded from memory (not sure how x86 handles it, might be a good idea to load the address into a register as early as possible, then jmp reg at the end), and therefore the pipeline cannot start sucking in the new code yet, so you get bubbles..."

and, quote scali: "also, branch prediction and 'gambling' has been in x86 for ages, at least since Pentium. Normal behaviour is to take a branch when the target address is lower than current (possible loop), and not take otherwise. On P4 you can use prefix bytes to force jmps to always be taken, or never)". (Iirc the P4 manuals states that these guides may or may not be followed - but I guess it won't hurt to use them in speed critical stuff. They're nop-ish opcodes on other platforms).

quote scali: "also, a 'wasteful' jumptable does not matter too much, if the cache is doing its associativity well :). It won't cache the parts that are never referenced, and the only thing that's wasted is the actual physical memory. But it depends on the granularity of the cache ofcourse."

And, again quote scali: "another thing I haven't seen mentioned yet is a translation table... if you have eg 16 bit options, but <= 256 possible options (possibly many-to-one mapping), you can use a 64k byte table, which translates your 16 bit option to an 8 bit index into the actual jumptable.".

I still find 64k (+1k for the 2nd level table) somewhat wasteful for this, but it's better than the full 256k. This still doesn't solve large ranges, though.
Posted on 2003-04-28 04:13:08 by f0dder
Posted on 2003-04-28 08:02:01 by Maverick