Hello guys,

I know I haven't been a long time on this board but since I was working on some algorithms I had just not enough time to do both. Now again there are some problems with my search engine where I am working on. I hope you smart guys have some good idea's!! :grin:

Well the problem is as follows:

- When I have a text like some chapter from a book and I want to search for any word my search engine just goes blasting away comparing word for word until the end of text and it works lineair. This is a search engine in the simplest form (the text is btw compressed, so actually I'm comparing some hex codes). Now an advanced search engine needs more advanced functionality like searching for exact phrases or more keywords combined with boolean logic like 'OR' / 'AND'. I hope you can follow..

Well the first extra functionality, for the exact phrase part, seems to be solved through a BoyerMoore algorithm and will run even sublineair.

That I understand, but the second one is a bit more difficult. I just don't know a correct algorithm for this but if I think about it then looking for search queries like "water <AND> sun" then I just can't think of a better method than running 2 comparisons, and doing so making the algo 2 times lineair. And even worse if someone wants to search for like many words combined with booleans than it will slow down the engine dramatically I think (you will have to compare with a whole array for every word). On the other hand assembly is fast ans maybe it won't be noticed if the queries are not that combined.

I hope there is some better way, because when I 'google' on the internet, the queries are in fact also boolean combined keywords and not exact matching patterns, if you know what I mean. But the speed of 'google' seems to be non decreasing.

People I know this question is a bit dropping like a bomb from outer space but if someone ever worked with searching algorithms he must know what I'm talking about. Anykind of help would be appreciated.

Thank You in advance,
Posted on 2003-09-13 05:00:48 by eisodur
yeah google is amazing...i heard one of its engineers actually worked for the NSA..maybe thats why its so damn good :grin:

anyway, i've never done anything like this, but one approach that mite work is to use the data input from the user (for instance 'cars AND sellers') and so on to construct a regex of sum sort and then apply that...regexes can be used on byte values as well, they dont have to be plain text...but i dont know abt the speed aspect
Posted on 2003-09-13 14:23:58 by AnotherWay83
You could try sorting the search terms and use a binary search tree or something of that nature. Then you wouldn't have to compare with all terms.
Posted on 2003-09-13 14:32:55 by _js_
thnx -js-

thats quite an idea!!

You mean to sort the query words and then do for every word in the text for example a short binary search on the query, that's quite a nice thought thnx!!:)
Posted on 2003-09-15 03:23:52 by eisodur
the problem with that is that you will be looking for upper and lower case and if you convert the word there searching for and the on one the webpage...

to solve this you have to either convert the words there searching for in either caps or lower case and same with whats on the sites then converting to binary... otherwise it will not work the way you want... converting might add on more time then you want... not sure..
Posted on 2003-09-15 03:32:41 by devilsclaw
well a fast way, that does require more storage, is to put each single 'entity' (word in this case) on its own with references to it (for this forum that would be posts for instance) and do a search on that.


sentence 1
"A monkey eats a banana."

sentence 2
"look at that monkey eating a banana!"

since punctuation usually is not relevant to the search results the easy way out is to get rid of it all together. Also words below a certain size like one or 2 letter words usually are either not added or they're compared to a "do not add" list so that your data is not cluttered with commonalities like "at" "and" "or" "that" "this" (for this example we nullify all such words)

monkey | 1,2
eats | 1
banana | 1,2

look | 2
eating | 2

a search for monkey would retrieve sentence 1 and sentence 2 (both with the same relevance)
a search for monkey AND eat would retrieve (after shifting) sentence 1 (since it's the only shared one)
a search for monkey OR eat would retrieve sentence 1 with bigger relevance than sentence 2 (sentence 1 is common data for all keywords while 2 is not)

see what I mean? :)
Posted on 2003-09-15 03:49:24 by Hiroshimator
A super-serious solution is to implement a weighted neural network whose nodes represent the compressed text data and whose weighted links identify common associations of terms. This trainable network would "get smarter" and yet its search times would remain relatively linear provided that the network links are relatively linear too.
Think about this - a neural network is just an N-linked tree (the database), but because of the weights placed on the links, it can be used to solve VERY complex problems - and grammatical syntax is something it's well suited toward.

Five Dollar !! Five Dollar !!
Posted on 2003-09-15 10:41:42 by Homer