You can remove one instruction here:

cmp subLngth, 1
jg @F
mov eax, -2 ; string too short, must be > 1
jmp Cleanup
@@:
==========
Change it to:

cmp subLnght,1
mov eax,-2
jle Cleanup

Don't warry about putting mov eax,-2 might be for
nothing it pairs and avoids "First Jump missprediction" in
jg @F

in case of jle Cleanup - this (as I think) is least probable case -
and it means you in most cases want it mispredicted.
(In other words - predicted as "not to be taken")
Cause it's also first jump it always to be predicted as not to be taken.
The Svin.
Posted on 2001-08-09 09:14:59 by The Svin



1. mov esi, lpSource
add esi, srcLngth
1.1 sub esi, subLngth
mov ExitLen, esi ; set Exit Length

; ----------------------------------------
; load shift table with value in subLngth
; ----------------------------------------
mov ecx, 256
2. mov eax, subLngth
lea edi, shift_table
rep stosd

; ----------------------------------------------
; load decending count values into shift table
; ----------------------------------------------
3. mov ecx, subLngth ; SubString length in ECX
dec ecx ; correct for zero based index
mov esi, lpSubStr ; address of SubString in ESI
lea edi, shift_table

xor eax, eax
================================================

Let's look up above on quoted code and get some data wich can help us
to take right decision, not even nowing what the algo is abot :)
- We can see (1.) that here are chain dependences which slow down CPU
- We can see that the code use the same value stored in subLngth (1.1;2.;3.)
- We can see that the code piece change the value of eax just once (2.) and doesn't
change it until the end (xor eax,eax)
Now we are ready to make very obvious optimizations :)

mov esi,lpSource ; first step in the road esi = lpSource+srcLngth-subLngth = ExitLen
mov eax,subLngth ;eax = subLngth
add esi, srcLngth ;second step
mov ecx,256 ;for stosd we do it here to remove depen.
sub esi, eax ;third step
lea edi,shift_table ;for stosd
mov ExitLen,esi ;finish culculations fot ExitLen
rep stosd ;everything is ready for it: eax=sunLength,ecx=256,edi=addr of shift_table

=========
Now look again at the quoted code from 3. :
3. mov ecx, subLngth ; SubString length in ECX
dec ecx ; correct for zero based index
mov esi, lpSubStr ; address of SubString in ESI
lea edi, shift_table
==============
we have subLength in eax so we can do instead of:
mov ecx,subLnght
dec ecx
--------------------
just:
lea ecx,[eax-1]
=========
and We DON'T NEED the:
mov edi,shift_table
cause addr of shift_table is already in edi (was put there before stosd)
so all this piece of code now can be not even faster but also smaller (in bytes):

mov esi,lpSource ; first step in the road esi = lpSource+srcLngth-subLngth = ExitLen
mov eax,subLngth ;eax = subLngth
add esi, srcLngth ;second step
mov ecx,256 ;for stosd we do it here to remove depen.
sub esi, eax ;third step
lea edi,shift_table ;for stosd
mov ExitLen,esi ;finish culculations fot ExitLen
rep stosd ;everything is ready for it: eax=sunLength,ecx=256,edi=addr of shift_table

lea ecx,[eax-1]
mov esi, lpSubStr
xor eax,eax

Posted on 2001-08-09 10:17:03 by The Svin
Alex,

Thanks for the work, I agree that the prologue code is not optimal but in speed terms it does not matter, the real action is in the loop code and it is here that I see that gains can be made.

The fundamental logic of this type of Boyer Moore that uses the 3 heuristics is that it must have a branch depending on what the value for the character is stored in the table and this means there is an unpredictable jump that is causing branch prediction buffer delays.

From my testing, the triple heuristic version is faster on Intel based hardware than either of the simplifications I have running because the Intel processor performs better in the mispredicted jumps than the AMD I have to test with but a simplified version is slightly faster on an AMD.

When I have done a bit more work on one of the simplified versions, I will post the code for anyone to have a play with.

Regards,

hutch@pbq.com.au
Posted on 2001-08-09 20:38:20 by hutch--
Thanks for the work

You are most welcome

I agree that the prologue code is not optimal but in speed terms it does not matter, the real action is in the loop code and it is here that I see that gains can be made.


I share your concern about main loop optimization -
it's the main aim for optimization of any algo.
But, of course, minor optimizations don't do any harm.
Yes, they make things just a little bit faster or smaller
but at least they do it.
And one of goals I post some optimizations for of already well written
algos is to show some basic tecnigues and usual flows
to keep it in mind while designing any other algo.
(In this example reloading some values which are already ready
to use)

An other thing why I always choose code of some fameous and
expirience programmers as you or Iczelion for optimizations is
that I want to encourage young promgrammers that there is
always room for optimizations, the way to do things better.
Even if code is written by some guru.

That's why I'm also happy when somebody does code that beats
mine :)

Or some code that is not faster or smaller but using some different logic or approach unknown to me.
'Cause some algos which are not perfect for some particular purpose may be the right ones in other case, and knowing them
made me be better in my proffession.

Too many words :)
Sorry :)

The Svin.
Posted on 2001-08-10 01:21:30 by The Svin
I dont know what is wrong but, with Svin's optimizations, search algo crashes.


However,Huch's original one works.

I have 1711h buffer to search and I search 20h bytes.Here is Svin's modified algo.Maybe I wrongly put it.



BMBinSearch proc startpos:DWORD,
lpSource:DWORD,srcLngth:DWORD,
lpSubStr:DWORD,subLngth:DWORD
LOCAL shift_table[256]:DWORD
LOCAL cval :DWORD
LOCAL ExitLen:DWORD

push ebx
push esi
push edi
cmp subLngth,1
mov eax,-2
jle Cleanup
mov esi,lpSource ; first step in the road esi = lpSource+srcLngth-subLngth = ExitLen
mov eax,subLngth ;eax = subLngth
add esi, srcLngth ;second step
mov ecx,256 ;for stosd we do it here to remove depen.
sub esi, eax ;third step
lea edi,shift_table ;for stosd
mov ExitLen,esi ;finish culculations fot ExitLen
rep stosd ;everything is ready for it: eax=sunLength,ecx=256,edi=addr of shift_table

lea ecx,[eax-1]
mov esi, lpSubStr
xor eax,eax

Write_Shift_Chars:
mov al, [esi] ; get the character
inc esi
mov [edi+eax*4], ecx ; write shift for each character
dec ecx ; to ascii location in table
jnz Write_Shift_Chars

; -----------------------------
; set up for main compare loop
; -----------------------------
mov ecx, subLngth
dec ecx
mov cval, ecx

mov esi, lpSource
mov edi, lpSubStr
add esi, startpos ; add starting position

mov edx, ExitLen
mov ebx, subLngth

jmp Cmp_Loop

; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


Calc_Suffix_Shift:
add eax, ecx ; add CMP count
sub eax, cval ; sub loop count
cmp eax, 0 ; test eax for zero
jg Add_Suffix_Shift
mov eax, 1 ; minimum shift is 1

Add_Suffix_Shift:
add esi, eax ; add suffix shift

Pre_Loop:
cmp esi, edx ; ExitLen
jg No_Match
mov ecx, cval ; reset counter in compare loop
xor eax, eax ; zero EAX

Cmp_Loop:
mov al, [esi+ecx]
cmp al, [edi+ecx] ; cmp characters in ESI / EDI
jne Set_Shift ; if not equal, get next shift
dec ecx
jns Cmp_Loop

jmp Match

Set_Shift:
mov eax, shift_table[eax*4] ; get char shift value
cmp eax, ebx ; subLngth
jne Calc_Suffix_Shift ; if not, jump to Calc_Suffix_Shift
lea esi, [esi+ecx+1] ; add bad char shift
jmp Pre_Loop

; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


Match:
sub esi, lpSource ; sub source from ESI
mov eax, esi ; put length in eax
jmp Cleanup

No_Match:
mov eax, -1

Cleanup:
pop edi
pop esi
pop ebx

ret

BMBinSearch endp
Posted on 2001-08-10 07:26:15 by LaptoniC
The thing I can think of :
change: jle Cleanup
to: jng Cleanup
For the rest both codes upto Write_Shift_Chars:
are doomed to produce the same result.
If you're sure of what your wrote could you, please
show the whole code where your call both procedures
with the same conditions to prove that one of them
would work different way given the same parameters?

The Svin.
Posted on 2001-08-12 23:49:39 by The Svin
I call it like this
invoke BMBinSearch,0,addr Magic,Magiclen,addr msum,20h

length of Magic is 1711h.
If you want I can send the source code.Also problem is not related with
The thing I can think of :
change: jle Cleanup
to: jng Cleanup


As you see sublength is 20h.
Posted on 2001-08-13 05:03:10 by LaptoniC
Yes, I want the source code.
Everything is clear through debugging.
Posted on 2001-08-13 05:08:17 by The Svin
Guys,

Its worth understanding what the two loops do before the main loop code, the "rep stosd" loop loads each of the 256 item table with the length of the pattern being searched for, the following loop writes the descending shift value for each character in the pattern into the correct position in the 256 member ascii table.

There are no speed optimisation that will make this code go faster, the real action is in the search loop code that starts with "Calc_Suffix_Shift:".

This version uses all 3 heuristics,

1. Bad character shift
2. Good Suffix Shift
3. Decremented good suffix shift with repeat characters

This is reasonably close to the design that Bob Boyer and L. Moore designed in 1977. I have 2 other variations that use only one of the two shifts, one is nearly as fast and the other is a lot slower.

In relation to the other BM algorithms I have to test against, its search speed is OK, bit faster on Intel, bit slower on AMD but it is a lot faster in mismatch recovery. This is because it has short paths in both shift types.

Where there is a problem is in the access to the table that holds the shift values. There appears to be a major stall after getting the value from the shift table and I have not found a way around it yet.

I have coded many variations to try and fix this problem as it would improve the speed very dramatically but none of them have solved the problem.

This is the code that is causing the stall as far as I can test.

Set_Shift:
mov eax, shift_table ; get char shift value

I am guessing that there is an L1 cache conflict between the data being scanned and the table that is built on the stack. I have tested it with a GLOBAL table in the .DATA section but it still has the same problem. I have variations that do not have the potential dependency chain problems but they do not test any faster.

Regards,

hutch@pbq.com.au
Posted on 2001-08-13 05:17:25 by hutch--
LaptoniC ,

You were right. The source helped me
I found the bug.
before xor eax,eax put
lea edi,shift_table
'cause rep stosd would override value of edi

The Svin.
Posted on 2001-08-13 23:19:34 by The Svin