Hello everyone..
I get stuck with case-switch implementation in assembly.
Can anyone tell how exactly we can implement case switch in assembly. Actually I get problem when implementing this logic.
where a is less than 8.
i = a / 8 ;
switch( i )
{
case1: func() ;
break ;
case2: func2() ;
break ;
:
:
casen: funcn()
break ;
}
How to implement this logic in assembly..
regards
I get stuck with case-switch implementation in assembly.
Can anyone tell how exactly we can implement case switch in assembly. Actually I get problem when implementing this logic.
where a is less than 8.
i = a / 8 ;
switch( i )
{
case1: func() ;
break ;
case2: func2() ;
break ;
:
:
casen: funcn()
break ;
}
How to implement this logic in assembly..
regards
bitRAKE wrote macros for that.
I have a copy of Greg Falen's switch/case macro and it tested up fine. If you need it let me know.
Regards,
hutch@movsd.com
Regards,
hutch@movsd.com
Code By Bitrake
;#############################################
; Optimized Switch/Case Macro
; by bitRAKE (aka Rickey Bowers Jr.)
;#############################################
SwitchThreshHold = 3 ;when is a bad idea to partition
.switch MACRO
SwitchNodes = 0
SwitchDefault TEXTEQU <>
SwitchExit TEXTEQU <>
ENDM
.case MACRO xVal:REQ, xNode:REQ
@CatStr(<SwitchValue>, %SwitchNodes) = &xVal
@CatStr(<SwitchNode>, %SwitchNodes) TEXTEQU <&xNode>
SwitchNodes = SwitchNodes + 1
ENDM
.default MACRO def:REQ
SwitchDefault TEXTEQU <&def>
ENDM
.endswitch MACRO sexit
LOCAL TheEnd, ww, oo, temp1, temp2, temp3, temp4
SwitchExit TEXTEQU <&sexit>
;; This is a bubble sort on the values of the case statements
;; the labels follow their associated values.
ww = SwitchNodes
WHILE ww
ww = ww - 1
oo = ww
WHILE oo
oo = oo - 1
temp1 = @CatStr(<SwitchValue>, %oo)
temp2 = @CatStr(<SwitchValue>, %ww)
;; We need MASM to evaluate this at assemble-time
% IF &temp1 GT &temp2
;; Numberic values can be swaped easily
@CatStr(<SwitchValue>, %oo) = &temp2
@CatStr(<SwitchValue>, %ww) = &temp1
;; Strings are a little harder...
;; Get the variable names
temp3 TEXTEQU @CatStr(<SwitchNode>, %oo)
temp4 TEXTEQU @CatStr(<SwitchNode>, %ww)
;; Get the value of those varibles
;; MASM doesn't allow &@CatStr(...)!
temp3 TEXTEQU &temp3
temp4 TEXTEQU &temp4
;; Swap them
@CatStr(<SwitchNode>, %oo) TEXTEQU &temp4
@CatStr(<SwitchNode>, %ww) TEXTEQU &temp3
ENDIF
ENDM
ENDM
;; This starts the trasversal of the array as if it were a bbtree
.SwitchPartition 0, SwitchNodes - 1
;; Output the code for the case nodes that haven't been done already
ww = SwitchNodes
WHILE ww
ww = ww - 1
;; Previously output nodes are cleared in the transveral
% IFNB <@CatStr(<SwitchNode>, %ww)>
@CatStr(<SwitchLabel>, %ww, <:>)
% @CatStr(<SwitchNode>, %ww)
IFNB <&SwitchExit>
% &SwitchExit
ELSE
jmp SwitchExitLabel
ENDIF
ENDIF
ENDM
SwitchDefaultLabel:
IFNB <&SwitchDefault>
% &SwitchDefault
IFNB <&SwitchExit>
% &SwitchExit
ENDIF
ENDIF
SwitchExitLabel:
ENDM
;; Transverse the sorted array of variables that was created for each
;; case statement like a balanced binary tree...
.SwitchPartition MACRO _min:REQ, _max:REQ
LOCAL delta, mmin, HighHalf
delta = _max - _min
mmin = _min
IF delta LT SwitchThreshHold
;; Output a string of nodes comparisons and a node on the end
WHILE delta GT 0
% cmp eax, @CatStr(<SwitchValue>, %mmin)
je @CatStr(<SwitchLabel>, %mmin)
mmin = mmin + 1
delta = delta - 1
ENDM
% cmp eax, @CatStr(<SwitchValue>, %mmin)
jne SwitchDefaultLabel
% @CatStr(<SwitchNode>, %mmin)
;; Clear this label variable so we don't output the code again
@CatStr(<SwitchNode>, %mmin) TEXTEQU <>
IFNB <&SwitchExit>
% &SwitchExit
ELSE
jmp SwitchExitLabel
ENDIF
ELSE
;; Output a branch test
delta = _min + (delta/2)
% cmp eax, @CatStr(<SwitchValue>, %delta)
jg HighHalf
je @CatStr(<SwitchLabel>, %delta)
;; Re-Enter this macro until we've tested all the nodes
;; note that we skip the node we just tested for.
.SwitchPartition _min, (delta-1) ;; Lower half of the range
HighHalf:
.SwitchPartition (delta+1), _max ;; High half of the range
ENDIF
ENDM
or for assembly purity warriors:
mov eax, dwValue
cmp eax, 1
je caseOne
cmp eax, 2
je caseTwo
cmp eax, 3
je caseThree
; fill in more
jmp caseError
caseOne:
call func1
jmp caseDone
caseTwo:
call func2
jmp caseDone
caseThree:
call func3
jmp caseDone
caseError:
; value is less 1 or greater 3
caseDone:
; go on here
For reasons linked to long pipelines and branch prediction, the code of beaster is truly optimal.
I wanted to say this because many, too many, when optimizing still think in old CPU terms, and think that e.g. a indirect jump table here would have been faster. Not at all.
Also, even switch() statements as compiled by HLL compilers, using "binary tree" methods, can be extremely poor on branch prediction.. and perform really bad in real world applications.
I think that optimal switch() compilation is one of the hardest challenges for a compiler writer.
Hutch I definitely ask you for the file. and it will nice of you if you put the file in member's area so that everyone can go through it.
Beaster do you think that there is lots of cmp inside the code.
Can anyone implement this with the help of Jump Table........
Regards
Beaster do you think that there is lots of cmp inside the code.
Can anyone implement this with the help of Jump Table........
Regards
Can anyone implement this with the help of Jump Table.....
http://www.asmcommunity.net/board/index.php?topic=2265&highlight=Jump+Table+macro
...and there are macros for the method beaster posted:
http://www.asmcommunity.net/board/index.php?topic=6073&highlight=CASE
The Intel compiler seems to do really well on switch/case code generation. The problem is partly because there is not enough information at compile-time to order the compares optimally. The Intel compiler can collect information during run-time (profiling) to improve the output from the compiler.
REAL programmers don't use branches. :grin:
Mav,
What's an indirect jump table ?
What's an indirect jump table ?
Has anyone evaluated my implementation of REPNE SCASD? An abreviated version is shown below. Notice that only 4 instructions are coded, and no repetive TEST-JE-TEST-JE-TEST-JE jack off branching is involved. The default is automatically taken. Short 'n sweet, tried and true, one non-conditional jump does it all. Comments? Ratch
.DATA
; I have a MACRO set to easily generate the search and jump tables below synchronously and simultaneously.
WMNTAB1 DD 1 ;the highest probability of 'hit' items should be put first in the table
DD 2
DD 3
DD 4
.......
DD 100
WMATAB1 DD CASE1
DD CASE2
DD CASE3
DD CASE4
.......
DD CASE100
DD DEFAULT ;default processing
WMTENT1 EQU 100
.CODE
MOV EDI,OFFSET WMNTAB1 ;beginning of search table
MOV ECX,WMTENT1 ;number of entries
REPNE SCASD ;search table for number
JMP DWORD PTR [WMATAB1-WMNTAB1-4+EDI] ;processing address from table
; CALL DWORD PTR [WMATAB1-WMNTAB1-4+EDI] ;processing address from table, alternative instruction
.......
CASE1:
........
CASE2:
.........
CASE3:
.........
CASE4:
.........
.........
CASE100:
..........
DEFAULT:
..........
Ratch, the CPU can't predict what is going to happen after that JMP and the data table isn't likely to be in the cache. Then there is the matter of REPNE SCASD being so slow.
The CPU correctly predicts the loading of the instruction cache with CMP/JE instructions and the instruction execute very quickly. Only one branch is taken which is as costly as the JMP in your above posted method. We might ask if there is an upper bound where the number of non-taken branches are more costly than the costs of the other method? To my knowledge no one has tested this.
The CPU correctly predicts the loading of the instruction cache with CMP/JE instructions and the instruction execute very quickly. Only one branch is taken which is as costly as the JMP in your above posted method. We might ask if there is an upper bound where the number of non-taken branches are more costly than the costs of the other method? To my knowledge no one has tested this.
bitRAKE,
What you say about the jump is true, but it is in doubt only one time, not a possible hundred times like it would be in a TEST-JE sequence using my example. What about SCASD being slow? I read somewhere that SCASD was fast, at least on a ATHLON, which I have. I also read that that maybe SCASW and SCASB were slower, but who knows. I have not had any good luck with getting consistent results on timings. Hard to believe that a built in instruction running a long sequence wouldn't be faster than a macro coding implementation. Can anyone shed some light on this? Ratch
What you say about the jump is true, but it is in doubt only one time, not a possible hundred times like it would be in a TEST-JE sequence using my example. What about SCASD being slow? I read somewhere that SCASD was fast, at least on a ATHLON, which I have. I also read that that maybe SCASW and SCASB were slower, but who knows. I have not had any good luck with getting consistent results on timings. Hard to believe that a built in instruction running a long sequence wouldn't be faster than a macro coding implementation. Can anyone shed some light on this? Ratch
> I have not had any good luck with getting consistent results on timings
First place to look: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
That's actually often the case. scasd is vector decode on Athlons, and has a latency of 4 clocks. Not good.
First place to look: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
Hard to believe that a built in instruction running a long sequence wouldn't be faster than a macro coding implementation. Can anyone shed some light on this?
That's actually often the case. scasd is vector decode on Athlons, and has a latency of 4 clocks. Not good.
Jan,
Does the latency mean a one time delay to get started, or does it mean a time delay for every iteration of STOSD? Can anyone provide some timings? I skimmed through the reference, but I did not see anything definitive about my question. Ratch
Does the latency mean a one time delay to get started, or does it mean a time delay for every iteration of STOSD? Can anyone provide some timings? I skimmed through the reference, but I did not see anything definitive about my question. Ratch
Ratch, document 22007, iirc. It is well documented.
Then add for the data cache miss (~90+ cycles on my PC).
Then add for the instruction cache miss (~90+ cycles on my PC).
Then add for the data cache miss (~90+ cycles on my PC).
Then add for the instruction cache miss (~90+ cycles on my PC).
Mav,
What's an indirect jump table ?
Modern CPU's can't predict where the jump will end, and this has a big penalty. Moreover, as bitRAKE promptly said, there are also good chances that the table isn't in the data cache.. unless your switch statement is inside a loop.
Moreover, my negative comment on binary tree switch implementations is because they will produce a lot of branch mispredictions, making them quite slow in real world situations.
Of course you can't have a linear test/jxx too long, or even that will be too slow.
So at the end it's a "profile and choose the best option available" thing.
In any case, I like to assign a probability to each case statement, and the most straightforward solution is the linear test/jxx one, although of course it can't be too long. But in my experience it's rare anyway to find switch statements with a lot of cases, so usually the linear test/jxx is understimated, and in real conditions is the one that performs a bit better (and also the most readable).
Mavrick & bitRAKE,
After the REPNE SCASD instruction, the CPU should be able to predict where the jump will end. It is completely determined by the value of EDI. One can insert many instructions after REPNE SCASD, and still do the jump afterwards, as long as EDI is not changed. This should give the CPU time to look ahead to the end of the jump. I cannot find those figures in the documentation you quoted bitRAKE. In fact, I find it confusing to read without a slow study of the principles involved. Therefore I am going to try to run a timing test tonight. I will let you know what I find out if I am successful. Ratch
After the REPNE SCASD instruction, the CPU should be able to predict where the jump will end. It is completely determined by the value of EDI. One can insert many instructions after REPNE SCASD, and still do the jump afterwards, as long as EDI is not changed. This should give the CPU time to look ahead to the end of the jump. I cannot find those figures in the documentation you quoted bitRAKE. In fact, I find it confusing to read without a slow study of the principles involved. Therefore I am going to try to run a timing test tonight. I will let you know what I find out if I am successful. Ratch
Hi Ratch,
I'm afraid you've a too high opinion of the target CPU.
IIRC, indirect jumps are always predicted as "not taken".. even if probably it wouldn't cost too many transistors to predict them in the scenario you described. AFAIK no x86 compatible CPU does it, though.. but it should be checked, since I'm tired and sleepy right now and can't use my memory and neurons correctly. :)
I'm afraid you've a too high opinion of the target CPU.
IIRC, indirect jumps are always predicted as "not taken".. even if probably it wouldn't cost too many transistors to predict them in the scenario you described. AFAIK no x86 compatible CPU does it, though.. but it should be checked, since I'm tired and sleepy right now and can't use my memory and neurons correctly. :)
I have attached the test piece I did with Greg Falen's Switch/Case macro system and I have found that it works fine, is reliable and can be nested within other switch statements with no problems at all.
If speed is a problem with recurring branching that Switch/Case logic is generally used for, I recently posted a technique that is far faster than any switch based system as it does not do the sequential comparisons that switch based systens do.
In non critical code, it just does not matter but if performance really matters in branching, the method I posted is worth looking at.
Regards,
hutch@movsd.com
If speed is a problem with recurring branching that Switch/Case logic is generally used for, I recently posted a technique that is far faster than any switch based system as it does not do the sequential comparisons that switch based systens do.
In non critical code, it just does not matter but if performance really matters in branching, the method I posted is worth looking at.
Regards,
hutch@movsd.com
Cheers... Sometimes I'm just a little slow ;)