Hi Everyone,
I finished and tried my timing program which I am sharing in the attachment to this message. It does a timing on a search and jump of up to 100 items using my SCASD scheme, followed by the the same using the classic CMP-JE method. Look over the code and see if I was unfair to either method. I found that running on my box, the SCASD uses half the ticks of the RTDSC instruction than the CMP-JE method. The program has commented jumps that can be activated to reverse the order of testing, or to test only one method. I look forward to hearing from all of you on what you get for results. Ratch

Hutch,
I will look over the code you posted. Ratch
Posted on 2003-01-10 19:49:43 by Ratch
Ratch, I think it would be valuable to see the time response curve for all input values. This way we can see were the cross over point is for different processors and cache effects.

My results:

JumpTable: 1000 loops, 258533 cycles
CompareList: 1000 loops, 84819 cycles

...and this is with the table in the cache! :confused:

This might be biased as I threw it together really quickly.
Posted on 2003-01-10 21:12:00 by bitRAKE
Hutch,
I looked through the code you sent. The code you posted sure enough implements high level SWITCH code. But sometimes SWITCH generates unnecessary dead code. In the code you sent, I extracted the following snippet. Notice that Case WM_CLOSE causes a superfluous JMP instruction where none is needed, because return 0 suffices to exit the WM_PAINT processing. It also appears that SCASD is faster in timing tests, at least on my box. See my other post about this. Ratch



Case WM_PAINT
invoke Paint_Proc,hWin
return 0

Case WM_CLOSE
Posted on 2003-01-10 21:45:59 by Ratch
Posted on 2003-01-10 21:54:55 by Ratch
Ratch, I've used this code to time individual instructions with very high accuracy. It is attached below - sorry for the mess. As I said - I just threw it together. Ignore all the other routines in there (labeling somewhat self documenting ;) ). You'll need VKDEBUG to view the program output. Maybe, you can find errors in my test? I looked at your's and you don't save EDI - which you'd have to in a WndProc. Also, my tests are cache aligned to reduce memory effects.

Output is now very steady at:

100 values
JumpTable: 1000 loops, 240060 cycles
CompareList: 1000 loops, 74012 cycles

I've included all to build and test yourself.
If you have RadASM just double-click the .RAP file :)

Values are hard to nail down when the data isn't cached. I feel these numbers are mostly memory timings, but I'll post them anyway. :) These are a single uncached (code & data) execution:

100 values
JumpTable: 2666 cycles
CompareList: 2644 cycles

1000 values
JumpTable: 17084 cycles
CompareList: 16710 cycles

10000 values
JumpTable: 21931 cycles
CompareList: 19254 cycles

Different story now...

JumpTable: 100 loops, 13176 cycles
CompareList: 100 loops, 12001 cycles

...here each value is tested starting at 100 going to 1.
All data & code is in the cache. Very interesting.

...and here with 10 to 1:

JumpTable: 10 loops, 332 cycles
CompareList: 10 loops, 565 cycles

Wow!?
Posted on 2003-01-10 22:08:05 by bitRAKE
bitRake,
I am having a hard time understanding your code. Too many MACROS to sift through. I cannot find the structure nm defined anywhere. Another problem is the syntax nm&.ldword. I think it belongs with a call to a PROC, which I don't use and am not familiar with. One code snippet I don't understand.


stopTime MACRO nm:REQ
pushad
xor eax,eax ;WHY ZERO OUT EAX?
CPUID
RDTSC
IFDIF <&nm>,<OverHead> ;WHAT IS THIS DOING?
;; delete the overhead for startTime/stopTime
sub eax,OverHead.ldword
sbb edx,OverHead.hdword
ENDIF
add nm.ldword,eax ;SHOULDN'T THIS BE A MOV nm.ldword,eax ?
adc nm.hdword,edx
popad
ENDM


It would be nice if you could simplify things a bit. I will look at it some more tomorrow.

To answer your question about EDI. Yes I do save it, along with EBX,EBP, and ESI in my WndProc. That is all done before and outside of the code in question. Ratch
Posted on 2003-01-11 00:07:44 by Ratch
Ratch, the structure is at the top of the include file, nm is the macro parameter (a string that names the structure being used). 'xor eax,eax' seemed to fix a timing error on P3's? OverHead is the name of the 'do nothing' structure used to find the overhead of the timing code - which is substracted from each execution of each timed proc. 'IFDIF <&nm>,<OverHead>' if the OverHead is subtracted from itself then it would be zero - not a good thing. Add because it accumulates timing from multiple runs. Sorry, I doubt this will ever be simplified for public consumption - it does the job for me. You might like Maverick's profiling code?
Posted on 2003-01-11 01:12:29 by bitRAKE
Ratch,

You are right, it is the return macro that adds the extra code, not Greg's macro but I have no problems with an extra dead jump as it does not slow the code down any, it just makes it slightly larger and where you have a high level style of code, it is no great loss.

For branching speed where it matters, have a look at the source I posted in this thread.

http://www.asmcommunity.net/board/showthread.php?threadid=9258

There will be little difference when there are only a few branches but as the count gets large, this method will show its speed advantage as it as the same branching overhead from 1 to as many as you like.

Regards,

hutch@movsd.com
Posted on 2003-01-11 01:13:10 by hutch--

Pals.. whatever test on a SWITCH construct must be made inputting the test sequence in a realistic manner. Surely random is more representative than a linear iteration. This is of key importance, expecially in consideration of branch prediction mechanisms.
Posted on 2003-01-11 03:03:30 by Maverick
Ratch: latency basically means "how long 'til the result is ready?". Also, the decoder can only handle 1 vectorpath instruction per clock (as opposed to 3 directpath ops).
Try replacing repne scasd with a simple loop (possible unrolled) - it will most likely be faster.
Then again, as Maverick points out, if you don't have tons of equally probable cases, a series of cmp/jcc is the way to go.

BTW: that optimization guide is well worth the read :)
Posted on 2003-01-11 05:47:34 by Jan Wassenberg
bitRAKE,
Your code is starting to make more sense now. How do you get ALIGN 64 to work with MASM? It appears your code is MASM. I get an error when I do anything greater than ALIGN 16. Why are you aligning everything on 64? Is it a cache thing? I notice that you jump to only one address in the jump table test, but you vary the jump address in the compare test. I don't know if that makes any difference. Also you PUSH and POP EDI, which is unnecessary for WndProc, because it has been saved already along with EBP,ESI, and EBX. I will need some more time to trace your code out. Right now I have some other things to do. Ratch

Jan,
I will do some timing tests on a loop vs SCASD when I get some time. Ratch

hutch,
I will look at the code when I get time. Ratch
Posted on 2003-01-11 13:23:32 by Ratch
Ratch, in my thinking EDI does not need to be saved in the CMP/JE code, but does for REP SCASD - it changes very little in the timing though (the stack is in the cache). ALIGN 64 requires the segments to be PAGE aligned (look at Time-It.Inc : at the top - I don't use the .MODEL directive). Yes, the cacheline size is 64 bytes. The code is designed to test instruction execution speed. So, it only gives a very narrow view of this code speed, imho.
Posted on 2003-01-11 16:21:17 by bitRAKE
Ratch,
I can't compile your example, (no file
INCLUDE \MASM32\INCLUDE\MYSTUFF.INC)
So, pls add my algo in your example and repost it.
.data

MyTabEntry dd 1, CASE1
dd 2,CASE2
dd 3,CASE3
dd 4,CASE4
dd 5,CASE5
dd 6,CASE6
dd 7,CASE7
dd 8,CASE8
dd 9,CASE9
dd 10,CASE10
dd 11,CASE11
dd 12,CASE12
dd 13,CASE13
dd 14,CASE14
dd 15,CASE15
dd 16,CASE16
dd 17,CASE17
dd 18,CASE18
dd 19,CASE19
dd 20,CASE20
dd 21,CASE21
dd 22,CASE22
dd 23,CASE23
dd 24,CASE24
dd 25,CASE25
dd 26,CASE26
dd 27,CASE27
dd 28,CASE28
dd 29,CASE29
dd 30,CASE30
dd 31,CASE31
dd 32,CASE32
dd 33,CASE33
dd 34,CASE34
dd 35,CASE35
dd 36,CASE36
dd 37,CASE37
dd 38,CASE38
dd 39,CASE39
dd 40,CASE40
dd 41,CASE41
dd 42,CASE42
dd 43,CASE43
dd 44,CASE44
dd 45,CASE45
dd 46,CASE46
dd 47,CASE47
dd 48,CASE48
dd 49,CASE49
dd 50,CASE50
dd 51,CASE51
dd 52,CASE52
dd 53,CASE53
dd 54,CASE54
dd 55,CASE55
dd 56,CASE56
dd 57,CASE57
dd 58,CASE58
dd 59,CASE59
dd 60,CASE60
dd 61,CASE61
dd 62,CASE62
dd 63,CASE63
dd 64,CASE64
dd 65,CASE65
dd 66,CASE66
dd 67,CASE67
dd 68,CASE68
dd 69,CASE69
dd 70,CASE70
dd 71,CASE71
dd 72,CASE72
dd 73,CASE73
dd 74,CASE74
dd 75,CASE75
dd 76,CASE76
dd 77,CASE77
dd 78,CASE78
dd 79,CASE79
dd 80,CASE80
dd 81,CASE81
dd 82,CASE82
dd 83,CASE83
dd 84,CASE84
dd 85,CASE85
dd 86,CASE86
dd 87,CASE87
dd 88,CASE88
dd 89,CASE89
dd 90,CASE90
dd 91,CASE91
dd 92,CASE92
dd 93,CASE93
dd 94,CASE94
dd 95,CASE95
dd 96,CASE96
dd 97,CASE97
dd 98,CASE98
dd 99,CASE99
dd 100,CASE100
dd 0,DEFAULT

.code
; MOV EAX,NUMBER
; MOV ECX,CTENT1
; MOV EDI,OFFSET CNTAB1
; REPNE SCASD
; CALL DWORD PTR [CATAB1-CNTAB1-4+EDI]

mov edx, NUMBER ; edx=NUMBER =101
mov eax, 100 ; eax->bottom limit=100
xor ecx, ecx ; ecx->top limit = 0
lowM:
mov ebx, eax ; ebx=>bottom
add eax, ecx ; ecx=>top
checkM:
shr eax, 1 ; divide by 2
cmp ecx, ebx ; done if top == bottom
je DEFAULT ; not found => jmp to Default
cmp edx, [MyTabEntry+8*eax] ; edx=101=NUMBER
jc lowM ;
lea ecx, [eax+1] ; ecx => top
lea eax, [ebx+eax+1] ; ebx=> bottom
jne checkM ; if not equal loop again
call dword ptr [MyTabEntry+ecx*8-8+4] ; call proc

Regards,
Lingo
Posted on 2003-01-11 23:58:57 by lingo12
lingo12,
Sorry I did not include everything you needed to assemble the example I posted. Instead of debugging your code, I am correcting that oversight by posting all the macros you need to build my example. The code you submitted appears to be a binary search scheme. I don't know if that will be faster, but the search list needs to be sorted for that to work. The SCASD and TEST-JE methods can arrange the most probable numbers at the beginning of the list to insure a quicker find. And for a short list, the overhead of computing the next search area might overwhelm the savings of a directed search. Anyway, let the best also win. Ratch
Posted on 2003-01-12 09:45:26 by Ratch
Ratch,

"..but the search list needs to be sorted for that to work."
It is the price of speed

"I don't know if that will be faster,.."
Just start scasd.exe

"The SCASD and TEST-JE methods can arrange the most probable numbers at the beginning of the list to insure a quicker find."
It is sorting too

"The code you submitted appears to be a binary search scheme"
"And for a short list, the overhead of computing the next search area might overwhelm the savings of a directed search"
That's right

Regards,
Lingo
Posted on 2003-01-12 13:23:37 by lingo12
lingo12,
I think you are on to something here. The binary method usually was significantly faster than either SCASD or TEST-JE. Even when the number was at the beginning of the table, which favors the SCASD and TEST-JE methods, the binary method consistently posted a shorter or equal timing than the other two methods. I am going to run some more tests, and if it performs as well, I will probably go binary for my WndProc. It will be a while though. Ratch
Posted on 2003-01-12 14:46:35 by Ratch
The number of "case" statements affects the choice (of the best "switch" algorithm) a lot.
So I can advice you to test for 100 case values *only* if in your real application you will use 100 case values.
It they will only be 10, for example, the test will be pretty useless.
But you probably++ already knew that anyway.
Posted on 2003-01-12 15:59:58 by Maverick
I wrote up a test piece to test .IF block branching against the array of addresses technique I posted recently and found the results to be an improvement over .IF blocks but not by all that much.

The test piece uses a test string that has 26 different characters, (lower case alphabet) looped 500000 times.

On this PIV the .IF block is about 20% slower so even though the array of addresses technique has a lower instruction count for each branch, it appears that the branch prediction problems inherant in any wide range of branching is the overriding factor in branching speed.

I got the .IF block to go a bit faster but putting the return jump in the block for each option, even though it leaves a dead jump after it in each instance. A range of 26 is enough to show that an array of addresses is faster, with a range that is a lot larger, I imagine that the speed difference would be a lot larger.

Regards,

hutch@movsd.com

LATER: Now the plot thickens, I increased the number of different characters to 52, (upper & lower case) and the array methods became a lot faster while the .IF block became slower by about the increase in size. On my PIV the array method is about 4 times faster and on my old PIII it is nearly 3 times faster. Just to make sure I did not mess anything up, I added a counter on both so it seems to be working properly.

I have changed the file to the new version with the extra characters in it.
Posted on 2003-01-12 20:20:03 by hutch--
Hi Everyone,
I tested the three methods some more, and here is what I found. First of all, after completing and displaying the results, my test program writes zeros in the code and data areas of itself before termination, so as to force a complete reload of the cache when the program starts again. Thus I am assured that each run starts from the same handicap. Maybe it is not necessary to do that, but who knows?

I always tested each method by a separate execution of the test program. No mixing methods within a test run. I found that the SCASD method was the fastest when the program executed cold code, no cache warming applied. Next fastest was BINARY, which was almost as fast when SCASD had to search to the end of the table. CMP-JE did abysmally on cold code, especially on long table searches.

Next I tested the methods using 2 or more passes through the code before the times were measured. WOWIE, what a difference one or more passes to establish a cache presence makes. The BINARY was fastest, unless the other two methods found the number early in their search. It appears that SCASD was last in speed and benefited the least from cache presence.

So to summarize. If you are going to do a one shot search of a table, use SCASD. If you think your code and table are going to be in cache for another go around, use the BINARY or CMP-JE method. The BINARY method appears to be the best all around method for sorted tables. Its power of two search algo is difficult to match for long tables. Ratch
Posted on 2003-01-16 20:11:58 by Ratch
Ratch,

"...for my WndProc.."
I'm curious what happens with your WndProc code

Regards,
Lingo
Posted on 2003-01-16 20:26:27 by lingo12