I played with buliaNaza's and Hutch's BM algos and also
with cresta's BM algo (link here):

I implemented some new ideas in faster BM algo so please test it
and indicate your CPU too

A. If we work without End of Buffer Control:
- it will be faster
- we will free one register
- and 1 function parameter less

If we move the second buffer with SearchPattern immediately after the big Buffer, our proc will find ALWAYS the SearchPattern,
hence we don't need the End of Buffer Control code.

For example we decide to read external file in the big Buffer and we allocate 4096 bytes for it and 255 bytes long SearchPattern
buffer and 1 byte more for my program too:

invoke GlobalAlloc, GMEM_ZEROINIT, 4096+255+1
mov? hmem, eax
and later
mov? ecx, hmem ? ? ?; ecx->lpBuffer1
lea? edx, ; edx->address of the? SearchPattern buffer
? ? ? ? invoke? BMLin12, ecx, edx, 34 ; where? 34 < 255 -> length of? the SearchPattern

Our code will be faster too:

B. If we work with dword comparing code only (without byte by byte comparing code)

C. If we try to implement a mechanism (without additional code)? to handle the occurrence of most repeated sequences of 4 bytes (first and last searching dwords at the beginning and the end) of the Search Pattern

Example: If our SearchPattern is 1234567890 and if we have in the big Buffer:? ....12347756340,....12346857380, .... 1234567890 and:
cmp , ebp? ? ?; ebp->first dword of the SrchPattern = 1234
jne @@1_1
we will have THREE comparisons in our case if we don?t switch to other code:
cmp ,ebx? ; ebx->last dword of the SrchPattern = 7890
jne @@2_1
if we switch to it after the first comparison (because the last four bytes are not 7890) we will have total just TWO comparisons

Here is the result with my P4 3.6 GHz Prescott

Timing routines by MichaelW? - www.masmforum.com
Please terminate any high-priority tasks and press ENTER to begin.

Boyer - Moore Tests:

BMBinS(BM Hutch -> Begining of the Buffer): 1731 clocks; Return value: 111
BMFJT (BM Cresta-> Begining of the Buffer): 1453 clocks; Return value: 111
BMLin (BM Lingo -> Begining of the Buffer): 987 clocks; Return value: 111
BMBinS(BM Hutch -> Middle of the Buffer): 48214 clocks; Return value: 19384
BMFJT (BM Cresta-> Middle of the Buffer): 36940 clocks; Return value: 19384
BMLin (BM Lingo -> Middle of the Buffer): 21571 clocks; Return value: 19384
BMBinS(BM Hutch -> End of the Buffer): 67942 clocks; Return value: 29621
BMFJT (BM Cresta-> End of the Buffer): 43276 clocks; Return value: 29621
BMLin (BM Lingo -> End of the Buffer): 27869 clocks; Return value: 29621
BMBinS(BM Hutch -> Not found): 94338 clocks; Return value: -1
BMFJT (BM Cresta-> Not found): 62079 clocks; Return value: -1
BMLin (BM Lingo -> Not found): 35697 clocks; Return value: -1

Press ENTER to exit...


Posted on 2005-07-18 19:13:28 by lingo12
For a 16byte search string or less you can do a very easy SSE implementation

Pad search string with 0's and align it on 16bytes

BMSSE:  ; search string Len <= 16
mov eax, ; data addr
mov ecx, ; data len
mov edx, ; search string addr
movdqa xmm1, dqword
mov edx, ; search string Len ; movzx edx, byte ; opt ??
and edx, 1111b  ;;; and dl if use movzx ;; 
jz .NoMatch
sub ecx,17  ; loop will start searching from last 16bytes of data
push ebx
DEC ecx
js .NoMatch
movdqu xmm0,dqword
pcmpeqb xmm0,xmm1
pmovmskb ebx,xmm0
and bx,word
cmp bx,word  ;;; sub bx,word ;; faster than cmp, nah??
jne .LP
mov eax,ecx ; offset of the match, switch EAX and ECX in above code to remove this line
pop ebx
retn 16
mov eax,-1
pop ebx
retn 16

AndTable dw 0, 1b, 11b, 111b, 1111b, 11111b, 111111b, 1111111b, 11111111b, 111111111b, 1111111111b, 11111111111b, 111111111111b, 1111111111111b, 11111111111111b, 111111111111111b, 1111111111111111b

Having to use the movdqU  might slow it down a bit, but I'd bet it's still faster than a non SSE implementation.
I wrote the above code in the post, it's not tested.

Posted on 2005-07-19 00:04:12 by r22
Thank you r22,? :)

"...but I'd bet it's still faster than a non SSE implementation."

You don't use any specialized algo and I'm wondering? why do you think so?? :shock:

Posted on 2005-07-19 11:57:41 by lingo12
Why I think it would be faster.
In general SSE instructions run faster, that's why.
Posted on 2005-07-19 14:58:29 by r22
Thanks again r22, :lol:

"In general SSE instructions run faster, that's why."

Really?? :shock:
O boy, you opened my eyes and now I know
why my code is so fast...? :D
According to you, it is because I use
faster SSE instructions rather than the Boyer-Moore algo in it, isn't it?


Posted on 2005-07-19 16:54:00 by lingo12