Hi, my old variant of Boyer-Moore CaseSensitiveNext search algo is smaller and runs faster than other posted here on my old Pentium200 MMX. I use my own test progi based on rdtsc instruction and MASM 6.15.8803. Now I just add startpos ?? for testing purposes only. I don't need it in my programs... .data ALIGN 4 l_table dd 256 Dup(0) .code BMCaseSNext proc startpos: DWORD, pBuffer: DWORD, lenBuffer: DWORD, pSrchData: DWORD, lenSrchData: DWORD push edi mov ecx, lenSrchData push esi lea edi, l_table push ebx mov esi, pBuffer xor edx, edx mov eax, startpos add esi, eax sub edx, ecx mov ebx, pSrchData mov eax, lenBuffer push ebp lea ebx, lea ebp, mov eax, 256 lea esi, mov ecx, edx Lens: mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx mov , edx sub eax, 16 jne Lens DifLen: mov al, inc ecx mov , ecx jne DifLen Search: mov al, mov ecx, add ebp, ecx jl Not_found sub esi, ecx test ecx, ecx mov ecx, edx jne Search push ebx push esi Compare: inc ecx jge Found mov al, dec ebx mov ah, dec esi cmp al, ah je Compare pop esi pop ebx inc esi xor eax,eax dec ebp jge Search Not_found: mov eax,-1 jmp Exit Found: pop esi pop ebx lea eax, Exit: pop ebp pop ebx pop esi pop edi ret 20 BMCaseSNext endp
Posted on 2001-05-24 03:19:00 by buliaNaza
buliaNaza, Compliments on the Boyer Moore algorithm you have posted. It shows that there is a range of difference depending on the hardware that runs the algorithm type. I put it as you posted it into the test bench that I used on the other two and tested it on both the PIII 600 I use and the AMD 550 that I use as a backup box and the times are as follows. I used a 52 meg file loaded into memory and then clocked it in real time in milliseconds. The version I posted originally is over 20% faster on the PIII but about 2 - 3% slower on the AMD 550.

              PIII            AMD 550
              ~~~~            ~~~~~~~
BMCaseSNext   295 ms          195 ms
BMBinSearch   230 ms          200 ms
I am sure many people would be pleased if you commented the code so it was a lot easier to read and so they could understand it. Regards & Compliments, hutch@pbq.com.au This message was edited by hutch--, on 5/24/2001 4:29:45 AM
Posted on 2001-05-24 04:28:00 by hutch--
Thanks hutch--, that means that I don't need a new technology and I will save a lot of money from buying a new computer...hah..hah..hah.. some raw statistics: 01-02 ; progi ..... 08 10 ; cond&uncond jumps 00 06 ; not pairable 01 10 ; AGI stalls 00 01 ; using variables in loops ..... .....
Posted on 2001-05-24 15:52:00 by buliaNaza
buliaNaza, I share the humour, I enjoyed coding on my old p166 but the disk died and I had to build a new one. I am interested in the toys you have used to produce the statistics as I know there is one bad stall in the version that I posted with the table access and a few other problems with dependency chains. When developing it, I had a range of variations that were smaller and had shorter instruction paths but they were no faster across both machines. Regards, hutch@pbq.com.au Its been a while since I played with Vtune and I generally found it less than useful
Posted on 2001-05-24 20:15:00 by hutch--
Thanks hutch-- for your time, I continue with my row statistics... Please delete this ugly&commented xor eax,eax ; zero EAX from MainSearchLoop in your progi..You are a guru and don't need it..
Posted on 2001-05-24 20:25:00 by buliaNaza
buliaNaza, You will laugh at this but by removing the xor eax, eax from the main comparison loop it becomes about 5% faster on the AMD but about 50% slower on the Intel. I put it in there for pairing purposes. It also allowed me to use EAX in some of the variations in the shift calculation. I have a number of variations that are smaller, have far less register dependency but are no faster, all of my testing shows the access to the shift table is the bottleneck and none of the variations to date have made it any faster. Regards, hutch@pbq.com.au
Posted on 2001-05-24 23:49:00 by hutch--
Nice work, maestros. Maybe it would be worthwhile to post a link to some academic site concerned with algo design, if there is any. (I don't mean design for any particular hardware.) It's an interesting and important mathematical subject.
Posted on 2001-05-29 14:17:00 by Larry Hammick
Interesting reading on string searches. Dictionary of Algorithms, Data Structures, and Problems. EXACT STRING MATCHING ALGORITHMS. The core of programming is algorithm development - there are sites everywhere with examples of different algorithms. I don't know of any that discuss algorithm development in a general sense - I have learnt by example and developed my own philosophy. Maybe artificial intelligents research sites would have a more general text on the development of algorithm development. :) Or maybe I don't understand what your looking for?
Posted on 2001-05-29 16:33:00 by bitRAKE
Thx bitRAKE for those links, good ones. I was curious in a general way about how much is known, mathematically, about algo and data- structure efficiency.
Posted on 2001-05-29 18:40:00 by Larry Hammick
disease, I test a running block of code in real time as it reasonably well duplicates the condition that it will be run with in a computer. With string data or similar byte sequence testing I build a test file long enough to get over 2 or 3 hundred milliseconds duration and then clock it in millisecond resolution. I prefer to use a sample big enough to get over a half a second as it further reduces the percentage error. I use a couple of macros from MASM32 to do the timing and they are very simple ones so if you run the code generator Prostart and build a test file, they are in the macro file for the project. The advantage of using this technique is that it runs in ring3 access without the need for special test beds or setups. Regards, hutch@pbq.com.au
Posted on 2001-05-30 11:00:00 by hutch--
Larry Hammick, Don Knuth has a nice set of books (The Art of Computer Programming, Volumes 1-3 Boxed Set) with mathematical analysis of some algorithms. I have another book which I've been slowly reading: How to Solve It : Modern Heuristics. Yes, all my money goes to books, and then computer hardware. :) This message was edited by bitRAKE, on 5/30/2001 1:32:52 PM
Posted on 2001-05-30 13:31:00 by bitRAKE
I know both books. Polya is (was?) one of world's foremost mathematicians and his book on heuristic is unique. I have a math background and I occasionally relapse into purely mathematical headspaces. I wondered if the mathematicians had made progress recently on the abstract notions of data structures and algorithms. E.g. is there some definite measure of "compressibility" of data? How many comparisons (suitably defined) are needed by (eg) Boyer-Moore in the limit of large string tables? Is there some symmetry between algos and the data structures for which they are suitable (for if, say, the data is really huge but of simple structure, we would do better to look for a smarter data structure than to look for a smarter algo). ... That sort of thing ...
Posted on 2001-06-02 02:02:00 by Larry Hammick
Larry, A maths background will be very useful to you in algorithm design, a large amount of the available data is in mathematical notation and it assumes that the reader has it as well. I come from a logic background a long time ago and I approach an algorithm from a different perspective which sometimes works for me and sometimes does not. Among the things to be careful about is the translation of algorithm theory to a working implementation. Assembler regularly breaks the rules because it does not directly fit the mathematical model for some algorithm design theory. A good case in point is a traditional byte scanner and a Boyer Moore algo for finding a pattern in a larger text, the BM algo has better logic which makes it faster in some circumstances. The problem with exact pattern matching theory is the assumption of constants such as the time it takes to compare 2 characters. The model works in part if everyone uses portable ANSI C but when you start coding the variations in assembler, the rules change. A forward byte scanner is by no means slow when written correctly and it always compares more characters than are in the source text. A "sublinear" algorithm can be slower by a lot if the implementation has high overhead. General drift is always treat design theory with a generous amount of practical implementation to get a good result. Regards, hutch@pbq.com.au
Posted on 2001-06-02 04:42:00 by hutch--
Hi Hutch, I hate to have in my ASSEMBLY (not C/C++/Pascal, etc) code additional rubbish as follow: - push argument1, push argument2, call proc and next pop argument2, pop argument1 etc. I prefer to send the arguments by registers. - local variables - save esp in other register and can't use it - .DO, .WHILE, .IF, invoke, and other HL stuff because I can't control my code in a low level - preserve Win dependent registers because I use this one and work with MainMessages only!

MainMessages    DD      WM_COMMAND,OnCommand 
                DD      WM_CREATE,OnCreate 
                DD      WM_DESTROY,OnDestroy 
                DD      WM_PAINT, OnPaint 
nMessages       DD      $-MainMessages 
WndProc   PROC       
          mov   eax, nMessages  
          mov   ecx, offset MainMessages  
          shr   eax, 3       ;get number of messages 
          mov   edx,  ;get uMsg in edx 
          dec   eax            
          jl    DefWindowProc   ;message not found 
          cmp   ,edx ;is it the correct message? 
          jnz   WndProc_1       ;no 
          push  ESP             ;save registers as required by Windows
          push  EBP      
          push  EBX      
          push  EDI      
          push  ESI      
          call  dword ptr   

;call correct procedure for the message 
;now =hwnd,=umsg,
           pop   ESI    
           pop   EDI    
           pop   EBX    
           pop   EBP    
           pop   ESP    
           jc    DefWindowProc ;nc=don't call DefWindowProcA eax=exit code 
           pop   edx    
           pop   ecx    
           pop   ebx    
           pop   ecx    
           pop   ebx    
           push  edx    
           ret         ;ret 16 
WndProc   ENDP         ; 
So, here is my original BM algo, a little bit complicated but may be faster in your computer...

.model	flat,	stdcall	 
BMCaseSNext	proto	  	 
ALIGN 4	  	  	 
l_table	DD	256 Dup(?)	; working table
;esi ->pBuffer     ; esi->buffer with bytes to be searched through
;ebp = lenBuffer   ; ebp = length of the buffer
;ebx ->pSrchData   ; ebx->pointer to data to be searched for
;ecx = lenSrchData ; ecx= length of data  to be searched for
;edi ->pl_table    ; edi->pointer to working table (must be aligned)
;call	BMCaseSNext
BMCaseSNext	proc 
 	add	esi, ebp         ;esi->pointer to the last byte of pBuffer
 	lea	ebx,  ;ebx-> pointer to the last byte of pSrchData
 	neg	ecx              ;ecx= - lenSrchData 
 	mov	edx, ecx         ;ecx = edx = - lenSrchData
 	add	ebp, ecx         ;sub lenSrchData from lenBuffer
 	mov	eax, 256         ;eax = counter
 	xor	ebp, -1          ;not ebp-> current negative  index 
 	mov	,  edx ; filling up 256 DD  l_table with   - lenSrchData  
 	mov	,  edx 
 	mov	, edx   
 	mov	, edx
 	mov	, edx
 	mov	, edx
 	mov	, edx    
 	mov	, edx  
 	mov	, edx
 	mov	, edx
 	mov	, edx
 	mov	, edx
 	mov	, edx
 	mov	, edx
 	mov	, edx
 	mov	, edx
 	sub	eax, 16	 
 	jne	Lens	

 	mov	al,     ; filling up with the real negative offset of every
 	inc	ecx                ; byte from the pSrchData, starting from the last to
 	mov	, ecx   ; the first, at the offset in l_table equal to the
 	jne	DifLen             ; ASCII code of the byte, multiplied by 4

Search:                      ; the main searching loop
 	mov	al,       ; get a byte  from pBuffer ->esi 
 	sub	ebp,    ; sub negative
Posted on 2001-06-17 16:25:00 by buliaNaza
buliaNaza, Thanks for making this code available and particularly for commenting the code so everyone knows how it works. Complex algorithms like this are appreciated by many programmers who are interested in writing advanced code. I am a bit busy at the moment and I don't have time to set it up to benchmark the code but I am sure that there are a number of people who are members of the forum who would be interested in helping to test the code on different machines. I like the idea of putting the shift table within the actual code as it makes the algorithm self contained so that it can be used as an object. Regards, hutch@pbq.com.au
Posted on 2001-06-17 20:04:00 by hutch--