Hi, my old variant of Boyer-Moore CaseSensitiveNext search algo is smaller and runs faster than other posted here on my old Pentium200 MMX. I use my own test progi based on rdtsc instruction and MASM 6.15.8803. Now I just add startpos ?? for testing purposes only. I don't need it in my programs... .data ALIGN 4 l_table dd 256 Dup(0) .code BMCaseSNext proc startpos: DWORD, pBuffer: DWORD, lenBuffer: DWORD, pSrchData: DWORD, lenSrchData: DWORD push edi mov ecx, lenSrchData push esi lea edi, l_table push ebx mov esi, pBuffer xor edx, edx mov eax, startpos add esi, eax sub edx, ecx mov ebx, pSrchData mov eax, lenBuffer push ebp lea ebx, lea ebp, mov eax, 256 lea esi, mov ecx, edx Lens: mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx sub eax, 16 jne Lens DifLen: mov al, inc ecx mov , ecx jne DifLen Search: mov al, mov ecx, add ebp, ecx jl Not_found sub esi, ecx test ecx, ecx mov ecx, edx jne Search push ebx push esi Compare: inc ecx jge Found mov al, dec ebx mov ah, dec esi cmp al, ah je Compare pop esi pop ebx inc esi xor eax,eax dec ebp jge Search Not_found: mov eax,-1 jmp Exit Found: pop esi pop ebx lea eax, Exit: pop ebp pop ebx pop esi pop edi ret 20 BMCaseSNext endp
buliaNaza, Compliments on the Boyer Moore algorithm you have posted. It shows that there is a range of difference depending on the hardware that runs the algorithm type. I put it as you posted it into the test bench that I used on the other two and tested it on both the PIII 600 I use and the AMD 550 that I use as a backup box and the times are as follows. I used a 52 meg file loaded into memory and then clocked it in real time in milliseconds. The version I posted originally is over 20% faster on the PIII but about 2 - 3% slower on the AMD 550.
I am sure many people would be pleased if you commented the code so it was a lot easier to read and so they could understand it. Regards & Compliments, email@example.com This message was edited by hutch--, on 5/24/2001 4:29:45 AM
PIII AMD 550 ~~~~ ~~~~~~~ BMCaseSNext 295 ms 195 ms BMBinSearch 230 ms 200 ms
Thanks hutch--, that means that I don't need a new technology and I will save a lot of money from buying a new computer...hah..hah..hah.. some raw statistics: 01-02 ; progi ..... 08 10 ; cond&uncond jumps 00 06 ; not pairable 01 10 ; AGI stalls 00 01 ; using variables in loops ..... .....
buliaNaza, I share the humour, I enjoyed coding on my old p166 but the disk died and I had to build a new one. I am interested in the toys you have used to produce the statistics as I know there is one bad stall in the version that I posted with the table access and a few other problems with dependency chains. When developing it, I had a range of variations that were smaller and had shorter instruction paths but they were no faster across both machines. Regards, firstname.lastname@example.org Its been a while since I played with Vtune and I generally found it less than useful
Thanks hutch-- for your time, I continue with my row statistics... Please delete this ugly&commented xor eax,eax ; zero EAX from MainSearchLoop in your progi..You are a guru and don't need it..
buliaNaza, You will laugh at this but by removing the xor eax, eax from the main comparison loop it becomes about 5% faster on the AMD but about 50% slower on the Intel. I put it in there for pairing purposes. It also allowed me to use EAX in some of the variations in the shift calculation. I have a number of variations that are smaller, have far less register dependency but are no faster, all of my testing shows the access to the shift table is the bottleneck and none of the variations to date have made it any faster. Regards, email@example.com
Nice work, maestros. Maybe it would be worthwhile to post a link to some academic site concerned with algo design, if there is any. (I don't mean design for any particular hardware.) It's an interesting and important mathematical subject.
Interesting reading on string searches. Dictionary of Algorithms, Data Structures, and Problems. EXACT STRING MATCHING ALGORITHMS. The core of programming is algorithm development - there are sites everywhere with examples of different algorithms. I don't know of any that discuss algorithm development in a general sense - I have learnt by example and developed my own philosophy. Maybe artificial intelligents research sites would have a more general text on the development of algorithm development. :) Or maybe I don't understand what your looking for?
Thx bitRAKE for those links, good ones. I was curious in a general way about how much is known, mathematically, about algo and data- structure efficiency.
disease, I test a running block of code in real time as it reasonably well duplicates the condition that it will be run with in a computer. With string data or similar byte sequence testing I build a test file long enough to get over 2 or 3 hundred milliseconds duration and then clock it in millisecond resolution. I prefer to use a sample big enough to get over a half a second as it further reduces the percentage error. I use a couple of macros from MASM32 to do the timing and they are very simple ones so if you run the code generator Prostart and build a test file, they are in the macro file for the project. The advantage of using this technique is that it runs in ring3 access without the need for special test beds or setups. Regards, firstname.lastname@example.org
Larry Hammick, Don Knuth has a nice set of books (The Art of Computer Programming, Volumes 1-3 Boxed Set) with mathematical analysis of some algorithms. I have another book which I've been slowly reading: How to Solve It : Modern Heuristics. Yes, all my money goes to books, and then computer hardware. :) This message was edited by bitRAKE, on 5/30/2001 1:32:52 PM
I know both books. Polya is (was?) one of world's foremost mathematicians and his book on heuristic is unique. I have a math background and I occasionally relapse into purely mathematical headspaces. I wondered if the mathematicians had made progress recently on the abstract notions of data structures and algorithms. E.g. is there some definite measure of "compressibility" of data? How many comparisons (suitably defined) are needed by (eg) Boyer-Moore in the limit of large string tables? Is there some symmetry between algos and the data structures for which they are suitable (for if, say, the data is really huge but of simple structure, we would do better to look for a smarter data structure than to look for a smarter algo). ... That sort of thing ...
Larry, A maths background will be very useful to you in algorithm design, a large amount of the available data is in mathematical notation and it assumes that the reader has it as well. I come from a logic background a long time ago and I approach an algorithm from a different perspective which sometimes works for me and sometimes does not. Among the things to be careful about is the translation of algorithm theory to a working implementation. Assembler regularly breaks the rules because it does not directly fit the mathematical model for some algorithm design theory. A good case in point is a traditional byte scanner and a Boyer Moore algo for finding a pattern in a larger text, the BM algo has better logic which makes it faster in some circumstances. The problem with exact pattern matching theory is the assumption of constants such as the time it takes to compare 2 characters. The model works in part if everyone uses portable ANSI C but when you start coding the variations in assembler, the rules change. A forward byte scanner is by no means slow when written correctly and it always compares more characters than are in the source text. A "sublinear" algorithm can be slower by a lot if the implementation has high overhead. General drift is always treat design theory with a generous amount of practical implementation to get a good result. Regards, email@example.com
Hi Hutch, I hate to have in my ASSEMBLY (not C/C++/Pascal, etc) code additional rubbish as follow: - push argument1, push argument2, call proc and next pop argument2, pop argument1 etc. I prefer to send the arguments by registers. - local variables - save esp in other register and can't use it - .DO, .WHILE, .IF, invoke, and other HL stuff because I can't control my code in a low level - preserve Win dependent registers because I use this one and work with MainMessages only!
So, here is my original BM algo, a little bit complicated but may be faster in your computer...
.data ALIGN 4 MainMessages DD WM_COMMAND,OnCommand DD WM_CREATE,OnCreate DD WM_DESTROY,OnDestroy DD WM_PAINT, OnPaint .... .... nMessages DD $-MainMessages .code WndProc PROC mov eax, nMessages mov ecx, offset MainMessages shr eax, 3 ;get number of messages mov edx, ;get uMsg in edx WndProc_1: dec eax jl DefWindowProc ;message not found cmp ,edx ;is it the correct message? jnz WndProc_1 ;no push ESP ;save registers as required by Windows push EBP push EBX push EDI push ESI call dword ptr ;call correct procedure for the message ;now =hwnd,=umsg, ;=wparam,=lparam pop ESI pop EDI pop EBX pop EBP pop ESP jc DefWindowProc ;nc=don't call DefWindowProcA eax=exit code pop edx pop ecx pop ebx pop ecx pop ebx push edx ret ;ret 16 WndProc ENDP ;
.386 .model flat, stdcall BMCaseSNext proto .data ALIGN 4 l_table DD 256 Dup(?) ; working table .code ;Usage: ;esi ->pBuffer ; esi->buffer with bytes to be searched through ;ebp = lenBuffer ; ebp = length of the buffer ;ebx ->pSrchData ; ebx->pointer to data to be searched for ;ecx = lenSrchData ; ecx= length of data to be searched for ;edi ->pl_table ; edi->pointer to working table (must be aligned) ;call BMCaseSNext BMCaseSNext proc add esi, ebp ;esi->pointer to the last byte of pBuffer lea ebx, ;ebx-> pointer to the last byte of pSrchData neg ecx ;ecx= - lenSrchData mov edx, ecx ;ecx = edx = - lenSrchData add ebp, ecx ;sub lenSrchData from lenBuffer mov eax, 256 ;eax = counter xor ebp, -1 ;not ebp-> current negative index Lens: mov , edx ; filling up 256 DD l_table with - lenSrchData mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx sub eax, 16 jne Lens DifLen: mov al, ; filling up with the real negative offset of every inc ecx ; byte from the pSrchData, starting from the last to mov , ecx ; the first, at the offset in l_table equal to the jne DifLen ; ASCII code of the byte, multiplied by 4 Search: ; the main searching loop mov al, ; get a byte from pBuffer ->esi Search1: sub ebp, ; sub negative
buliaNaza, Thanks for making this code available and particularly for commenting the code so everyone knows how it works. Complex algorithms like this are appreciated by many programmers who are interested in writing advanced code. I am a bit busy at the moment and I don't have time to set it up to benchmark the code but I am sure that there are a number of people who are members of the forum who would be interested in helping to test the code on different machines. I like the idea of putting the shift table within the actual code as it makes the algorithm self contained so that it can be used as an object. Regards, firstname.lastname@example.org