In the most recent war with Hutch (sigh), it was suggested that my "ignorance is showing here", as I suggested using a Boyer-Moore search for log file scanning, as they are "written line by line". Enough about that war, but this got me thinking...

First, let's assume that a case-sensitive exact pattern match is good enough for us. I can't think of much log-file scanning where I wouldn't need at least case insensitivity (which could be hacked into BM), or even a full-blown regexp... but let's assume case-sensitive exact match is good enough.

Next assumption is that we can either read the full log file into memory, or that our system has a fairly fast memory-mapped-file implementation (and that the log file isn't so sillily large we can't map it all at once - 1-2GB logs?).

Now, finding + extracting a line from a log file with BM should be easy. Use BM to find a match, then scan back&forward from there to find line delimiters - presto, you have a full line for display. For log files with time/date stamping in the log, this should suffice, and it ought to be faster than the overhead of reading the file line-by-line and using a standard byte-scanner FindStringInString kind of routine.

But what if we wanted the line number as well? The first option that sprung to mind was mixing the code to do "GetLineFromFile" with a normal byte-scanner FindStringInString thingy. I haven't put much thought into implementing such a hybrid, but I guess it could be a bit faster than first calling a GetLineFromFile and then FindStringInString... each char would only be checked once, instead of twice (first check/scan for LF to find end-of-line, then again in the FindStringInString).

If we "need" to work line-by-line, Boyer-Moore might be too slow to use... then again, there are possible optimizations. Like, you don't have to create the skiptable for each line - use a custom BM implementation. Or perhaps the GetLineFromFile logic could be implemented inside the BM - anything goes, as long as it works and is fast ^_^.

However, what about this following approach? First, use BM to locate the pattern in the "bigbuffer". Now, to get the line the hit was on, use MMX/SSE2 to count all the LF's from start of the "bigbuffer" to the found match. This still requires reading each and every byte from the start of the file to the found match, but this can be done *really* fast with MMX/SSE2, and doesn't involve a lot of compares & conditional jumps in the inner-loop. Even some non-mmx "bit trickery" code could do this pretty decently.

Some of the algorithms in the following thread could probably be rewritten to scan for LF instead of zero:
http://www.asmcommunity.net/board/index.php?topic=12089

I can't come up with a good example where I would need to scan a *huge* ASCII log-file for a exact match, and at the same time needing line numbers - it's hutch who brought that example up as an example of me being ignorant. But he brought up the issue, so I guess there's a need for something like this :)

Any thoughts on even smarter ways to implement such a search, than the BM+countLF? Or would you guys use a GetLineFromFile+FindStringInString kind of approach? Initially, ideas+brainstorming is wanted more than code, but we can always do implementations + benchmark if necessary.
Posted on 2004-02-21 15:54:58 by f0dder
Well, a logging facility designed to both manage huge log files and have line number support suggests an index file. This is a rather simple and effective approach for such problems.

Assuming that we have no index file, we could build one if we assume multiple searches on the same unchanged data source.

The method you suggested is actually one for counting lines until a certain position in the stream. This could be a fast method indeed (good work) and can be also used to construct the index file.

And about the searching itself, doing it line by line is a waste... the routine to get a line has to search for the end of line token, which just adds overhead.

Using memory-mapped-file mechanism is a good idea, especially if it's optimized towards sequential access to the file (for example in Win32 you could supply CreateFile() with the FILE_FLAG_SEQUENTIAL_SCAN hint, I'd like to think it also helps the MMF implementation).
Posted on 2004-02-21 16:51:16 by death
Heya, death.

I don't think it would take us long to agree that a monolithic text file is rather retarded for logging, but that's not for discussion in this thread - the idea about building an index table could be nice if we can assume multiple searches on *unchanged* data, but I guess the assumption in this example would be huge monolithic textfiles in a 'live' system.


The method you suggested is actually one for counting lines until a certain position in the stream.

Yep - but combined with the boyer-moore to locate the pattern...


This could be a fast method indeed (good work) and can be also used to construct the index file.

Hummm, how? You'd have to turn it into a "fast find first LF" instead of "count amount of LF's" if you want to make a line index file, wouldn't you?


And about the searching itself, doing it line by line is a waste... the routine to get a line has to search for the end of line token, which just adds overhead.

My thought too.

As for memory-mapped files, they're slower than reading in the full file to memory, because of the pagefaults associated with MMF. In very-huge-file situations, using MMFs would mean windows can discard the filecache, where a read-into-huge-buffer might mean paging. I don't really know how large these huge logs would be, but for ASCII text we could probably assume "small enough to fit in a memory buffer on a decent system". FILE_FLAG_SEQUENTIAL_SCAN would definitely be used, even though I don't know if it changes the cache scheme for MMF.

Anyway, the question to concentrate on still remains: optimal way of finding exact pattern + linenumber in a huge text file.
Posted on 2004-02-21 17:25:28 by f0dder

Hummm, how? You'd have to turn it into a "fast find first LF" instead of "count amount of LF's" if you want to make a line index file, wouldn't you?

Yes, I missed that. But couldn't the extensions be used to find end of line characters as well?


As for memory-mapped files, they're slower than reading in the full file to memory, because of the pagefaults associated with MMF.

Well, assuming FILE_FLAG_SEQUENTIAL_SCAN was used, maybe there won't be any page faults at all? (and maybe I'm too optimistic?)
I guess this needs to be further researched.


Anyway, the question to concentrate on still remains: optimal way of finding exact pattern + linenumber in a huge text file.

/me waits for more ideas :)
Posted on 2004-02-21 18:15:30 by death

Yes, I missed that. But couldn't the extensions be used to find end of line characters as well?

Sure, you could probably use MMX/SSE2 for that as well, depending on line length it might be better to use some dword scanning with bit trickery...


Well, assuming FILE_FLAG_SEQUENTIAL_SCAN was used, maybe there won't be any page faults at all? (and maybe I'm too optimistic?)
I guess this needs to be further researched.

I think you're too optimistic - this would require the entire file to be mapped in at the call to CreateFileMapping :). It might change the cache strategy to map in larger amounts for each pagefault though, and perhaps to discard memory faster in lowmem situations - probably worth playing around with (or perhaps read further in inside win2k ;)).

At least #PF isn't issued per-page, that would KILL performance. I think the default is around 64kb, but dunno if this is a fixed limit or whether the amount to page in is determined by a bunch of factors.

Anyway, the use of BM to find the hit, scan back+forw to form a full line, and then some MMX/SSE2 code to scan LF's up to the found instance is the best approach I could think of yet.
Posted on 2004-02-21 18:35:43 by f0dder


I think you're too optimistic - this would require the entire file to be mapped in at the call to CreateFileMapping :).

No, what I mean is that by specifying the sequential access hint, the caching mechanism can do something like load next page before it's actually accessed, if it detects close accesses to it. But maybe I am totally wrong. Needs experimentation!
Posted on 2004-02-21 18:51:21 by death
Anytime someone says large file to me I tell them to use a model for the data, but we aren't using this as a solution for some lame reason -- let us assume we are migrating some legacy data that got out of hand before management decided to upgrade. A sad reality in the real world. :)

Finding an Initial Canidate:

Assuming the data is in memory and not in the cache, the search algorithm should be memory bound on modern processors except when the pattern being looked for is sufficiently large - which will result in faster performance. Sufficiently large means the largest skip distance is greater than the cacheline size. Many simple algorithms meet the requirements for patterns that are not sufficiently large (i.e. no need to optimize beyond ensuring greatest initial skip).

Confirming Canidate:

The smaller the alphabet used the greater the importance of optimizing this part of the algorithm to quickly reject pseudo-matches. Generating a FSM based on the pattern is the best approach, modeling the expected input stream and search pattern. Many ideas can be incorporated into the FSM to maximize skip distance. For example, testing duplicate characters first to quickly reject false forward skips on non-match.

Searching...:

First an initial canidate is found, then the code attempts to confirm the canidate. When no useful information is known about the current possition the algortihm loops back to finding an initial canidate.
Posted on 2004-02-21 20:09:01 by bitRAKE
Case A
We have two threads and one big shared buffer
One thread for BM and one for scanning for CR(see bitRAKE's algo)
BM has two parts: fast part -> which jump forward in the buffer
and one slow part which compare the rest of the bytes (from the n-1 byte
until the 1st byte) with the pattern for matching
When the fast part finds out the last matching symbol of the pattern the algo continues
with the slow part (comparing). If the bytes match we exit with the buffer offset in eax else the algo
increment current buffer address with 1 and continue with fast part until the end of the buffer.
When the BM algo exit with match we will stop temporary 2nd thread, check out it's current buffer
offset and restart it with new end of the buffer equal of the buffer offset in eax of the BM algo.
If the 2nd thread finished before the 1st we'll need to return back from the end of the buffer until the buffer offset in eax of the BM algo.

Case B
Using my InString algo looking for the 1st byte of the pattern and CR SIMULTANEOUSLY in the larger string
http://www.asmcommunity.net/board/showthread.php?threadid=13638&

Regards,
Lingo
Posted on 2004-02-21 20:24:45 by lingo12
Afternoon, All.

For a practical use for such an algo, we must assume that multiple services and possibly applications would be running on the users system whenever this BMandLineCount proc is used.

Therefore, as lingo12 suggested, it would be mandatory for the proc to include the use of threads for the searches. Setting the thread priority (SetThreadPriority) to THREAD_PRIORITY_HIGHEST is also a must.
~~~~~~~~~~~~~~~~~~~~~~~~~~~
As we seem to be using this proc only for searching log files, it can be assumed that the pattern to search for is quite small ( < 2028 characters). If this is the case, then we could possibly speed up the counting of lines by creating a separate thread for each character in the pattern.
i.e. If the pattern is "MyPattern", then there would be 9 LineCount threads; each thread with a different start-offset:
1st thread 0 offset ('M')
2nd thread 1 offset ('y')
3rd thread 2 offset ('P')
etc.
Each LineCount thread would shift by the patterns length (i.e. 9).
So each thread would end up testing for LF only every ( 1/ patternlength).
Once the pattern is found by the BM thread, the LineCount threads would either continue counting forward (if they're running behind the BM thread) or run backwards (if they got ahead of the BM thread).
Once the LineCount threads are forward/back to where the BM thread is, then each LF count from each LineCount thread is added together, thus supplying us with the actual line count of the pattern match.

Someone's now going to say that setting and running > 5 threads at high priority at the same time is a waste of time :).

Cheers,
Scronty
Posted on 2004-02-22 04:06:52 by Scronty


If we do not have to count lines, and the pattern to search for is longer than a couple of symbols, then the possibly sub-linear algorithms (like BM) should clearly give the best performance when doing a single on-line search (i.e. there is nothing to gain from building any kind of index structure).

If we also need to know the line number where the match occured (and it is again a single on-line search) we need to look at all symbols to be able to count the number of lines, which means linear running time.

I would imagine doing a single pass over the data will be faster than doing multiple passed (this may of course be wrong on the ever-changing hardware these days), which would imply that a DFA method might be suitable (because it could count the number of lines and look for matches, while only looking at each character once).

On the other hand, there are cache issues that could affect such, which I am sure I will now see lengthy lectures about because I dared to speak up :rolleyes:



On a side note .. is it just me, or are most discussions lately turning into large-scale fights over who has spent the most time in the loo reading the Intel manuals? .. I think some people need to wake up and smell the 0xC0FFEE :grin:
Posted on 2004-02-22 06:14:49 by Jibz
Jibz, 0xC0FFEE in one hand and Intel manual in the other - I think you found the problem! Or, maybe it is all the 0xBADBEEF found around the globe. :grin:
Posted on 2004-02-22 11:39:47 by bitRAKE


No, what I mean is that by specifying the sequential access hint, the caching mechanism can do something like load next page before it's actually accessed, if it detects close accesses to it. But maybe I am totally wrong. Needs experimentation!

Oopsie, access to MMF pages that are already loaded is uncontrolled...
Posted on 2004-02-23 10:48:01 by death