Hello everyone - first time poster, long time lurker :) You guys started me off asm'ing and hopefully will help me here!
I love the Boyer-Moore search algorithm - and originally found it on Jeremy Collake's site - http://www.collakesoftware.com, but I was wondering if it were to be possible to write an extension of this algorithm that would be able to search the same file with many strings (that aren't necessarily the same length). Right now, my file searcher program (making a Windows file searcher - can search thru the whole system for strings in any kind of file) takes out all of the binary junk from files (non-ascii chars) and searches each file for hundreds of words (specified by the user). It seems incredibly pointless to have to keep generating the table (some of the files that it searches are 10 megs in size). I cant figure this out - but maybe someone could: How could you search thru the same file for hundreds of strings faster than having to re-call the proc each time?
Thanks for your time everyone! (and I won't say thanks in advance cause that really ticks me off when people say that to me :D)
-GNUru
Posted on 2004-09-29 18:45:21 by GNUru
GNUru,

The basic design of the BM family of pattern matching algorithms is setting up the tables for the word/phrase/byte sequence being searched for. There are other designs but with any reasonable length pattern, the BM algos will eat the rest alive.

If the source being searched is over about 1k and the pattern is over 6 characters, the advantage starts to show. The longer the search pattern, the faster the search is. If you can set it up that way, always look for the longest pattern rather than a subset of it as it will increase the speed by a lot.

There are 3 seperate BM search algos in the MASM32 library that were based directly off Bob Boyers original research work. One of the members here also published a BM algo in the APJ under the name of BuliaNaza.

With Jeremy Collake's version, this gives you 5 to play with.
Posted on 2004-09-29 23:43:17 by hutch--
While the Boyer-Moore algorithm (and it's many variants) is a very good choice for searching for a single string, it is not always the best choice for searching for any of a set of strings.

If you have a moderately large set of search patterns, it's often worth the extra preprocessing of a more advanced data structure like e.g. in the Aho-Corasick algorithm :alright:.
Posted on 2004-09-30 02:49:37 by Jibz
GNUru, as a first step it's reasonably straightforward to split an existing BM algorithm in two parts: table-gen and the actual search. This probably won't save you that much time though - if you have hundreds of search strings, consider Jibz' advice.
Posted on 2004-09-30 03:20:23 by f0dder
Thanks for the responses! :)
I have been searching around on google for a while, and can only find a good implementation of the Aho-Corasick search algorithm in ClamAV Antivirus. (And that is in C++). Am I going to have to make a C++ dll and call it from asm or is there any asm code out there that I cant find?
See ya!
-GNUru
Posted on 2004-09-30 06:24:33 by GNUru
GNUru, :-D
I agree with f0dder here:
"as a first step it's reasonably straightforward to split an existing BM algorithm in two parts: table-gen and the actual search."


and disagree with:

"This probably won't save you that much time though - if you have hundreds of search strings, consider Jibz' advice."


Here is my idea: :wink:
- If we have 200 search strings, we need to initialize once 200 skip tables with 256 dwords each
- next we need to start once a program which will generate in memory another speed optimized program
(search part of BM) as the followng example
- we will execute speed optimized program (search part of BM) when we read every file in a the buffer

Example: (search part of BM)


; 403120 B87A563412 mov eax,1234567Ah ;eax->end of the buffer
; 403125 EB09 jmp 403130
;BeginOR1: ; Search 1st string = "654321" with length 6
; 403130 8B0B mov ecx,0EF123456h ;ecx->current negative offset in the buffer
; 403140 0FB61408 movzx edx,byte ptr [eax+ecx] ;edx->next byte from the buffer
; 403144 2B0C957C563412 sub ecx,dword ptr [edx*4+1234567Ch] ;1234567Ch->SkipTable 1
; 40314B 72F3 jb 403140
; 40314D 83C101 add ecx, 1
; 403150 7F2E jg 403180 ;Not found->Go to next string->BeginOR2:
; 403152 807C08FA36 cmp byte ptr [eax+ecx-6],36h ;1st byte of the 1st string
; 403157 75E7 jne 403140
; 403159 807C08FB35 cmp byte ptr [eax+ecx-5],35h ;2nd byte of the 1st string
; 40315E 75E0 jne 403140
; 403160 807C08FC34 cmp byte ptr[eax+ecx-4],34h ;3rd byte of the 1st string
; 403165 75D9 jne 403140
; 403167 807C08FD33 cmp byte ptr [eax+ecx-3],33h ;4th byte of the 1st string
; 40316C 75D2 jne 403140
; 40316E 807C08FE32 cmp byte ptr [eax+ecx-2],32h ;5th byte of the 1st string
; 403173 75CB jne 403140
; 403178 EB56 jmp 4031D0 ;Label Found
;BeginOR2: ;Search 2nd string="97821" with length 5
; 403180 8B0B mov ecx,0EF453456h ;ecx->current negative offset in the buffer
; 403190 0FB61408 movzx edx,byte ptr[eax+ecx] ;edx->next byte from the buffer
; 403194 2B0C957D563412 sub ecx,dword ptr[edx*4+1234567Dh] ;1234567Dh->Skip Table2
; 40319B 72F3 jb 403190
; 40319D 83C101 add ecx,1
; 4031A0 7F1E jg 4031C0 ;4031C0->Label NotFound
; 4031A2 807C08FB39 cmp byte ptr [eax+ecx-5],39h ;1st byte of the 2nd string
; 4031A7 75E7 jne 403190
; 4031A9 807C08FC37 cmp byte ptr [eax+ecx-4],37h ;2nd byte of the 2nd string
; 4031AE 75E0 jne 403190
; 4031B0 807C08FD38 cmp byte ptr [eax+ecx-3],38h ;3rd byte of the 2nd string
; 4031B5 75D9 jne 403190
; 4031B7 807C08FE32 cmp byte ptr [eax+ecx-2],32h ;4th byte of the 2nd string
; 4031BC 75D2 jne 403190
; 4031BE EB10 jmp 4031D0 ;Label Found
;....
;.... ; Search 200th string = "......."
;BeginOR200:


Note:
mov eax,1234567Ah ; eax->end of the buffer
mov ecx, 0EF123456h ; ecx-> current negative offset in the buffer

so when we start the search:

(eax+ecx) = address of the buffer + 5 for the 1st string with length 6
(eax+ecx) = address of the buffer + 4 for the 2nd string with length 5

Regards,
Lingo
Posted on 2004-10-01 00:41:31 by lingo12
Lingo's idea is an interesting one although it is not a simple task to initialise so many character tables but it needs to be kept in mind that the cycle count of initialising a 256 member table is relatively low and the data is cache local once this is done so you have no other table loading overheads once it is initialised in a conventional BM implimentation.

Now repeatedly scanning a file that is already in memory is far faster than the first time it is scanned and back when I worked on this algo design on a PIII long patterns (128 characters and up) were scanning at over 1 gig a second when testing on a large word or data count against the source.

Its well known that no single search algo will do everything well so knowing something about the data being scanned is necessary. With short patterns (< 6 characters) a linear scanner is probably very hard to beat as they still only branch test on a 1st character match and the rest is direct byte scan speed.

If you do go in the direction of a BM algo, try for a design that has the full heuristics designed by Bob Boyer and not just a cut down version as the full design as it is a lot more flexible in the source data it can handle than either of the main variations.

Here is the page I put up a few years ago to explain how the basics of a BM algo work.

http://www.movsd.com/bm.htm

I don't know much about the design that Jibz mentioned but the little I could find on it says it scans in linear time where a BM beats linear time in almost all instances except quadratic ones where the data and pattern size don't suit the design.
Posted on 2004-10-01 05:42:28 by hutch--
Hm, I "think" I understand that example code. One (or more) question -
I just need to build the table for each string, and then replace edi with (if I set up an array of DWORDS) (in the collake example - it is a 256 byte structure - a DWORD would be able to point to that correct?)

Here is the original function if you do not know the original one I am working with.
If someone has a faster version - or a simpler one - :) - Pls send :D


BMTABLE STRUCT
dd 256 dup(0)
BMTABLE ENDS
RevStrCmp MACRO
LOCAL RevStrCmpLoop,RevStrCmpEnd
dec ecx
jz RevStrCmpEnd
RevStrCmpLoop:
dec esi
dec ebx
mov al,byte ptr [esi]
cmp al,byte ptr [ebx]
jnz RevStrCmpEnd
dec ecx
jnz RevStrCmpLoop
RevStrCmpEnd:
ENDM
BMBinBin proc uses ebx edi esi pData1:DWORD, Data1Len:DWORD,
pData2:DWORD, Data2Len:DWORD

LOCAL EndOfData:DWORD
LOCAL bm_table:BMTABLE

mov eax,pData1
add eax,Data1Len
mov EndOfData,eax

lea edi,bm_table
push edi
mov eax,Data2Len
mov ecx,256
stosd_loop:
mov [edi],eax
add edi,4
dec ecx
jnz stosd_loop
pop edi

mov ecx,eax
push eax
dec ecx
mov esi,pData2
push edi
xor eax,eax
FillTableLoop:
mov al,byte ptr [esi]
inc esi
mov dword ptr [edi+(eax*4)],ecx
dec ecx
jns FillTableLoop

pop edi ; edi->bm_table
pop ecx ; ecx=size of string
mov esi,pData1
mov ebx,pData2
dec ecx
add ebx,ecx
add esi,ecx
mov edx,EndOfData


MainSearchLoop:
xor eax,eax
mov al,byte ptr [esi]
mov ecx,dword ptr [edi+(eax*4)]
test ecx,ecx
jz strcmp
add esi,ecx
NextIteration:
cmp esi,edx
jbe MainSearchLoop
jmp ReturnNotFound

strcmp:
push ebx
push esi
mov ecx,Data2Len
RevStrCmp
jz ReturnFound
pop esi
pop ebx
inc esi
jmp NextIteration

ReturnFound:
pop ebx
mov eax,esi
pop esi
ret

ReturnNotFound:
xor eax,eax
ret
BMBinBin endp

-GNUru
Posted on 2004-10-01 05:47:01 by GNUru
Didn't see your new post hutch :) Maybe I should load multiple files at a time - until I get up to, say, 20 megs, then I can search them all - and based on the position of the offset of the string, determine which file it is in. (Could speed it up...)
-GNUru
Posted on 2004-10-01 05:49:21 by GNUru
Eh, scratch the multiple files idea - it seemed good on paper, but keeping track of 500 some strings per file would be nuts.
Was also wondering - am I going to have to GlobalAlloc memory for each string and save the return of GA to an array of dwords? I am reading all of these strings from an external file.
Or..
Do I save an array of the actual strings (they are all less than 100 chars long - not many words are more than 100 chars :D) into a large buffer (100*500 or something), and address the actual strings from the buffer?
This may be easier to manage.
Aho-Corsaik - from the example I found - is slower than Boyer Moore...
-GNUru
Posted on 2004-10-01 06:38:02 by GNUru

I don't know much about the design that Jibz mentioned but the little I could find on it says it scans in linear time where a BM beats linear time in almost all instances except quadratic ones where the data and pattern size don't suit the design.

Yes, the Aho-Corasick algorithm is linear, and BM is (potentially) sub-linear. However AC is linear no matter how many strings you fill into it, while BM is a sub-linear application for each string.

Of course it requires a certain volume of search patterns before it's worth the extra complexity of building the more advanced data structure, so it depends on the problem at hand which is the best choice :).
Posted on 2004-10-01 07:24:07 by Jibz
Jibz,

It actually sounds like an interesting capacity in a search algo, I know basically how to do large word list searches using a hash table and you use the same technique for doing large count word relacements but it suffers the hash table build time and it must scan the source in a linear manner to recognise each word.

With the BM implimentations, as long as the data is not ordered in a very unusual way, average case times on patterns long enough (> 6 characters) become very sublinear with the increase in pattern length so in the majority of cases it is a lot faster than a linear search technique. Also the first scan is the slowest, repeat scans on a source already in memory are much faster for basically hardware reasons.

Another factor I have learnt with search algos is how fast they are rather than the linearity theory as you can have sublinear algos that are just plain slow if the overhead to deliver the sublinearity is high. With a conventional byte scanner, the recovery time from mismatch effects the overall performance a lot so you try for low overhead comparison branches.

A BM is a more complex algo in its operation but the jump from each mismatch is a lot faster than simply scanning the source and it picks up speed by the number of characters it does not scan.

The new design sounds interesting but I have not seen enough technical data on it to know what it does yet.
Posted on 2004-10-01 11:25:16 by hutch--
GAH - guys - can anyone help me with seperating the Boyer Moore function? Everything I try fails miserably :) Hutch - Im sure that this would be a good feature in the Masm32Lib; 'AddSearchString', and 'SearchBufferForStrings'.
-GNUru
Posted on 2004-10-01 20:24:26 by GNUru
I realize this is an old thread, but I figured better late than never. Here is my implementation of the simplified BM algorithm:


; The Boyer-Moore table generator
bmTable proc uses esi edi pBytes:DWORD, byteLen:DWORD, pTable:DWORD

mov eax, byteLen
mov ebx, pTable
mov ecx, 256
LEN_FILL_LOOP:
mov DWORD PTR [ebx], eax
add ebx, 4
sub ecx, 1
jnz LEN_FILL_LOOP

mov ecx, 0
mov edx, byteLen
sub edx, 1
JMP_FILL_LOOP:
mov ebx, byteLen
sub ebx, ecx
sub ebx, 1

mov eax, pBytes
add eax, ecx
movzx eax, BYTE PTR [eax]
shl eax, 2
add eax, pTable
mov DWORD PTR [eax], ebx

add ecx, 1
cmp ecx, edx
jl JMP_FILL_LOOP

ret
bmTable endp



; The Boyer-Moore search algorithm
bmSearch proc uses esi edi pMemory:DWORD, memLen:DWORD, pBytes:DWORD, byteLen:DWORD, pTable:DWORD
LOCAL memEnd:DWORD, byteEnd:DWORD

mov eax, pMemory
add eax, memLen
sub eax, 1
mov memEnd, eax
mov eax, pBytes
add eax, byteLen
sub eax, 1
mov byteEnd, eax

mov esi, pMemory
add esi, byteLen

SEARCH_LOOP:
push esi
mov edi, byteEnd

SEARCH_NEXT_BIT:
sub esi, 1
movzx eax, BYTE PTR [esi]
movzx ebx, BYTE PTR [edi]

cmp eax, ebx
jne SEARCH_DONE ; bytes don't match
sub edi, 1
cmp edi, pBytes
jge SEARCH_NEXT_BIT ; haven't reached the beginning of pBytes

sub esi, pMemory
mov eax, esi
pop esi
ret

SEARCH_DONE:

shl eax, 2
add eax, pTable
mov eax, DWORD PTR [eax]
pop esi
add esi, eax
cmp esi, memEnd
jl SEARCH_LOOP

or eax, 0FFFFFFFFh
ret
bmSearch endp


These function assume that your alphabet is the ASCII alphabet [0,FFh] and that the length of the search string is <= FFFFFFFFh bytes. The search function returns the offset to the beginning of the search string from the beginning of the memory block to be searched or FFFFFFFFh if it is not found.

My table is usualy declared as


searchTable DWORD 256 DUP(?)

but if you know you're only going to be searching for keyboard charcters then it can be quite a bit smaller and still work.

I hope that helps someone.

Spara
Posted on 2004-11-22 17:19:46 by Sparafusile
Thanks for the algo :)
Posted on 2004-12-01 19:24:42 by GNUru