Hello everyone..

I get stuck with case-switch implementation in assembly.
Can anyone tell how exactly we can implement case switch in assembly. Actually I get problem when implementing this logic.

where a is less than 8.
i = a / 8 ;
switch( i )
case1: func() ;
break ;
case2: func2() ;
break ;


casen: funcn()
break ;

How to implement this logic in assembly..

Posted on 2003-01-10 03:51:05 by processingspeed
bitRAKE wrote macros for that.
Posted on 2003-01-10 04:40:48 by bazik
I have a copy of Greg Falen's switch/case macro and it tested up fine. If you need it let me know.


Posted on 2003-01-10 05:41:48 by hutch--
Code By Bitrake


; Optimized Switch/Case Macro
; by bitRAKE (aka Rickey Bowers Jr.)

SwitchThreshHold = 3 ;when is a bad idea to partition

.switch MACRO
SwitchNodes = 0
SwitchDefault TEXTEQU <>
SwitchExit TEXTEQU <>

.case MACRO xVal:REQ, xNode:REQ
@CatStr(<SwitchValue>, %SwitchNodes) = &xVal
@CatStr(<SwitchNode>, %SwitchNodes) TEXTEQU <&xNode>
SwitchNodes = SwitchNodes + 1

.default MACRO def:REQ
SwitchDefault TEXTEQU <&def>

.endswitch MACRO sexit
LOCAL TheEnd, ww, oo, temp1, temp2, temp3, temp4

SwitchExit TEXTEQU <&sexit>

;; This is a bubble sort on the values of the case statements
;; the labels follow their associated values.
ww = SwitchNodes
ww = ww - 1
oo = ww
oo = oo - 1
temp1 = @CatStr(<SwitchValue>, %oo)
temp2 = @CatStr(<SwitchValue>, %ww)
;; We need MASM to evaluate this at assemble-time
% IF &temp1 GT &temp2
;; Numberic values can be swaped easily
@CatStr(<SwitchValue>, %oo) = &temp2
@CatStr(<SwitchValue>, %ww) = &temp1
;; Strings are a little harder...
;; Get the variable names
temp3 TEXTEQU @CatStr(<SwitchNode>, %oo)
temp4 TEXTEQU @CatStr(<SwitchNode>, %ww)
;; Get the value of those varibles
;; MASM doesn't allow &@CatStr(...)!
temp3 TEXTEQU &temp3
temp4 TEXTEQU &temp4
;; Swap them
@CatStr(<SwitchNode>, %oo) TEXTEQU &temp4
@CatStr(<SwitchNode>, %ww) TEXTEQU &temp3

;; This starts the trasversal of the array as if it were a bbtree
.SwitchPartition 0, SwitchNodes - 1

;; Output the code for the case nodes that haven't been done already
ww = SwitchNodes
ww = ww - 1
;; Previously output nodes are cleared in the transveral
% IFNB <@CatStr(<SwitchNode>, %ww)>
@CatStr(<SwitchLabel>, %ww, <:>)
% @CatStr(<SwitchNode>, %ww)
IFNB <&SwitchExit>
% &SwitchExit
jmp SwitchExitLabel


IFNB <&SwitchDefault>
% &SwitchDefault

IFNB <&SwitchExit>
% &SwitchExit



;; Transverse the sorted array of variables that was created for each
;; case statement like a balanced binary tree...
.SwitchPartition MACRO _min:REQ, _max:REQ
LOCAL delta, mmin, HighHalf

delta = _max - _min
mmin = _min

IF delta LT SwitchThreshHold
;; Output a string of nodes comparisons and a node on the end
WHILE delta GT 0
% cmp eax, @CatStr(<SwitchValue>, %mmin)
je @CatStr(<SwitchLabel>, %mmin)
mmin = mmin + 1
delta = delta - 1
% cmp eax, @CatStr(<SwitchValue>, %mmin)
jne SwitchDefaultLabel
% @CatStr(<SwitchNode>, %mmin)
;; Clear this label variable so we don't output the code again
@CatStr(<SwitchNode>, %mmin) TEXTEQU <>
IFNB <&SwitchExit>
% &SwitchExit
jmp SwitchExitLabel
;; Output a branch test
delta = _min + (delta/2)
% cmp eax, @CatStr(<SwitchValue>, %delta)
jg HighHalf
je @CatStr(<SwitchLabel>, %delta)

;; Re-Enter this macro until we've tested all the nodes
;; note that we skip the node we just tested for.
.SwitchPartition _min, (delta-1) ;; Lower half of the range
.SwitchPartition (delta+1), _max ;; High half of the range
Posted on 2003-01-10 06:04:46 by roticv
or for assembly purity warriors:
	mov	eax, dwValue

cmp eax, 1
je caseOne
cmp eax, 2
je caseTwo
cmp eax, 3
je caseThree

; fill in more

jmp caseError

call func1
jmp caseDone

call func2
jmp caseDone

call func3
jmp caseDone

; value is less 1 or greater 3

; go on here
Posted on 2003-01-10 06:39:39 by beaster

For reasons linked to long pipelines and branch prediction, the code of beaster is truly optimal.
I wanted to say this because many, too many, when optimizing still think in old CPU terms, and think that e.g. a indirect jump table here would have been faster. Not at all.

Also, even switch() statements as compiled by HLL compilers, using "binary tree" methods, can be extremely poor on branch prediction.. and perform really bad in real world applications.

I think that optimal switch() compilation is one of the hardest challenges for a compiler writer.
Posted on 2003-01-10 08:08:03 by Maverick
Hutch I definitely ask you for the file. and it will nice of you if you put the file in member's area so that everyone can go through it.

Beaster do you think that there is lots of cmp inside the code.
Can anyone implement this with the help of Jump Table........

Posted on 2003-01-10 08:35:28 by processingspeed

Can anyone implement this with the help of Jump Table.....
I wrote a macro for this, too:

...and there are macros for the method beaster posted:

The Intel compiler seems to do really well on switch/case code generation. The problem is partly because there is not enough information at compile-time to order the compares optimally. The Intel compiler can collect information during run-time (profiling) to improve the output from the compiler.

REAL programmers don't use branches. :grin:
Posted on 2003-01-10 09:47:11 by bitRAKE

What's an indirect jump table ?
Posted on 2003-01-10 10:09:11 by JimmyClif
Has anyone evaluated my implementation of REPNE SCASD? An abreviated version is shown below. Notice that only 4 instructions are coded, and no repetive TEST-JE-TEST-JE-TEST-JE jack off branching is involved. The default is automatically taken. Short 'n sweet, tried and true, one non-conditional jump does it all. Comments? Ratch

; I have a MACRO set to easily generate the search and jump tables below synchronously and simultaneously.

WMNTAB1 DD 1 ;the highest probability of 'hit' items should be put first in the table
DD 2
DD 3
DD 4
DD 100
DD DEFAULT ;default processing

MOV EDI,OFFSET WMNTAB1 ;beginning of search table
MOV ECX,WMTENT1 ;number of entries
REPNE SCASD ;search table for number
JMP DWORD PTR [WMATAB1-WMNTAB1-4+EDI] ;processing address from table
; CALL DWORD PTR [WMATAB1-WMNTAB1-4+EDI] ;processing address from table, alternative instruction


Posted on 2003-01-10 11:31:42 by Ratch
Ratch, the CPU can't predict what is going to happen after that JMP and the data table isn't likely to be in the cache. Then there is the matter of REPNE SCASD being so slow.

The CPU correctly predicts the loading of the instruction cache with CMP/JE instructions and the instruction execute very quickly. Only one branch is taken which is as costly as the JMP in your above posted method. We might ask if there is an upper bound where the number of non-taken branches are more costly than the costs of the other method? To my knowledge no one has tested this.
Posted on 2003-01-10 12:08:08 by bitRAKE
What you say about the jump is true, but it is in doubt only one time, not a possible hundred times like it would be in a TEST-JE sequence using my example. What about SCASD being slow? I read somewhere that SCASD was fast, at least on a ATHLON, which I have. I also read that that maybe SCASW and SCASB were slower, but who knows. I have not had any good luck with getting consistent results on timings. Hard to believe that a built in instruction running a long sequence wouldn't be faster than a macro coding implementation. Can anyone shed some light on this? Ratch
Posted on 2003-01-10 12:26:49 by Ratch
> I have not had any good luck with getting consistent results on timings
First place to look: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf

Hard to believe that a built in instruction running a long sequence wouldn't be faster than a macro coding implementation. Can anyone shed some light on this?

That's actually often the case. scasd is vector decode on Athlons, and has a latency of 4 clocks. Not good.
Posted on 2003-01-10 13:41:04 by Jan Wassenberg
Does the latency mean a one time delay to get started, or does it mean a time delay for every iteration of STOSD? Can anyone provide some timings? I skimmed through the reference, but I did not see anything definitive about my question. Ratch
Posted on 2003-01-10 14:45:03 by Ratch
Ratch, document 22007, iirc. It is well documented.
Then add for the data cache miss (~90+ cycles on my PC).
Then add for the instruction cache miss (~90+ cycles on my PC).
Posted on 2003-01-10 15:18:56 by bitRAKE


What's an indirect jump table ?
Sorry, the term I used was a bit confusing, if read as "Indirect JumpTable", when instead I meant it as "IndirectJump Table". In short, I meant simply a jump table. ;)

Modern CPU's can't predict where the jump will end, and this has a big penalty. Moreover, as bitRAKE promptly said, there are also good chances that the table isn't in the data cache.. unless your switch statement is inside a loop.

Moreover, my negative comment on binary tree switch implementations is because they will produce a lot of branch mispredictions, making them quite slow in real world situations.

Of course you can't have a linear test/jxx too long, or even that will be too slow.

So at the end it's a "profile and choose the best option available" thing.

In any case, I like to assign a probability to each case statement, and the most straightforward solution is the linear test/jxx one, although of course it can't be too long. But in my experience it's rare anyway to find switch statements with a lot of cases, so usually the linear test/jxx is understimated, and in real conditions is the one that performs a bit better (and also the most readable).
Posted on 2003-01-10 15:30:38 by Maverick
Mavrick & bitRAKE,
After the REPNE SCASD instruction, the CPU should be able to predict where the jump will end. It is completely determined by the value of EDI. One can insert many instructions after REPNE SCASD, and still do the jump afterwards, as long as EDI is not changed. This should give the CPU time to look ahead to the end of the jump. I cannot find those figures in the documentation you quoted bitRAKE. In fact, I find it confusing to read without a slow study of the principles involved. Therefore I am going to try to run a timing test tonight. I will let you know what I find out if I am successful. Ratch
Posted on 2003-01-10 16:40:41 by Ratch
Hi Ratch,
I'm afraid you've a too high opinion of the target CPU.
IIRC, indirect jumps are always predicted as "not taken".. even if probably it wouldn't cost too many transistors to predict them in the scenario you described. AFAIK no x86 compatible CPU does it, though.. but it should be checked, since I'm tired and sleepy right now and can't use my memory and neurons correctly. :)
Posted on 2003-01-10 17:27:30 by Maverick
I have attached the test piece I did with Greg Falen's Switch/Case macro system and I have found that it works fine, is reliable and can be nested within other switch statements with no problems at all.

If speed is a problem with recurring branching that Switch/Case logic is generally used for, I recently posted a technique that is far faster than any switch based system as it does not do the sequential comparisons that switch based systens do.

In non critical code, it just does not matter but if performance really matters in branching, the method I posted is worth looking at.


Posted on 2003-01-10 18:31:14 by hutch--
Cheers... Sometimes I'm just a little slow ;)
Posted on 2003-01-10 18:39:36 by JimmyClif