Hi,All

Can I use CRC32 or Adler algos to receive 2 dword checksums of 2 strings:

str1 db "abdef",0
str2 db "Abdeg",0

and next to compare checksum1 and checksum2 to get whish string is "bigger" or "smaller"?
I don't want to compare strings with string compare functions as CMPSB...
Any ideas?

Thanks in advance
Posted on 2002-09-11 10:32:47 by lingo12
those algorithms will not tell you if one of those strings is 'bigger' or 'smaller'. in fact, two different strings might produce the same checksum.
Posted on 2002-09-11 10:53:18 by Tola
if you really don't want to do this on yourself (why?) then try to
use the api.lstrcmpi
Posted on 2002-09-11 10:55:55 by mob
Hi Tola,
"in fact, two different strings might produce the same checksum."
If it a true what is the purpose of these algos?

"those algorithms will not tell you if one of those strings is 'bigger' or 'smaller'"
Why do you think so? Are you familiar with the relation between
length of the string, the weight of the members and the checksum?

Any other ideas?
Posted on 2002-09-11 12:43:04 by lingo12

"in fact, two different strings might produce the same checksum."


This is true, of course, because you can have really an infinite number of strings and only a finite number of different CRCs. Hence, there must be duplication.


If it a true what is the purpose of these algos?

They are for error checking. If I'm not mistaken, the CRC algorithm is designed to (almost) guarantee different CRCs for slight variations in data. So if your data gets corrupted from transmission you can detect this.

I think what everyone is getting that is that a string length computation is not going to be terribly expensive unless the strings are quite large... There are many good algorithms for string lenght on the board for this.

--Chorus
Posted on 2002-09-11 13:25:56 by chorus
Hi Chorus,

"They are for error checking. If I'm not mistaken, the CRC algorithm is designed to (almost) guarantee different CRCs for slight variations in data."
lol, what mean "slight variations in data." -> maybe 2 different strings

"This is true, of course, because you can have really an infinite number of strings and only a finite number of different CRCs. Hence, there must be duplication."

Theoretically yes, in practice no.


"There are many good algorithms for string length on the board for this."

For general use yes but in my case they are slow,
so I prefer to compare 10000 times 2 DWORDS rather than 10000 times 30 DWORDS.


Any other ideas? May be Thomas or Vecna?
Posted on 2002-09-11 16:12:29 by lingo12
A checksum would work when you are testing if string1 != string2, but if the checksums match you are still going to have to do a final Strcmp to validate the match. Despite the low probability for error, if you leave out the Strcmp validation then well, it's just bad programming.
Posted on 2002-09-11 16:37:53 by iblis

"those algorithms will not tell you if one of those strings is 'bigger' or 'smaller'"
Why do you think so? Are you familiar with the relation between
length of the string, the weight of the members and the checksum?


In some rare cases, you *might* be able to determine the difference ('bigger'/'smaller') between two strings by comparing their CRCs, but there's no general method. CRCs are simply too complex to be useful for string comparison (other than testing for equality).
If you want to test a big list of strings for duplicates, it's definitely more efficient to calculate a hash code of all strings first (CRC, adler or just a simple string hash function), and compare those hash values first. If the hash values match, there's a good chance the strings are equal, although you can't be sure (as chorus pointed out), so you'll have to double check then. But it saves you from comparing a lot of strings.

Thomas
Posted on 2002-09-11 16:48:03 by Thomas

lol, what mean "slight variations in data." -> maybe 2 different strings


Ex. string one: "Hello"
string two: "Hfklo"

if you did a simple checksum (adding up the bytes) you'll get the same checksum for both strings b/c f=e+1 and k = l-1 so your net effect is null.
CRC should avoid situations like this where only a little has gone wrong: the two strings will get different CRCs and hence you should not overlook the corruption.


Theoretically yes, in practice no.


It's pretty simple: a CRC is 4 bytes. A string is (usually) more. If you can map all strings to unique CRCs then you've stumbled upon the best compression algorithm in the world. Is it likely that two strings will have the same CRC? No. But it could happen and if you don't account for the possibility in your program then I'd say that's poor programming style.


For general use yes but in my case they are slow,
so I prefer to compare 10000 times 2 DWORDS rather than 10000 times 30 DWORDS.


Calculating the length of the string is going to be computationally far cheaper then the CRC32. Unless the strings are fixed and you don't have to recalculate the CRC. In this case, the length is fixed too.

--Chorus
Posted on 2002-09-11 16:56:52 by chorus
Hi iblis,
"but if the checksums match you are still going to have to do a final Strcmp to validate the match"
It is impossible when you transfer data between 2 computers so
the ppl use protocols and CRC and that is no a bad programming.
May be I'm looking for solution in the wrong forum?
Posted on 2002-09-11 17:03:26 by lingo12
Thank you Thomas for the professional answer,

"In some rare cases, you *might* be able to determine the difference ('bigger'/'smaller') between two strings by comparing their CRCs, but there's no general method.
CRCs are simply too complex to be useful for string comparison (other than testing for equality)."

CRCs are not too complex for their creators and may be they
have additional publications but anyway thanks again.
Posted on 2002-09-11 17:19:44 by lingo12
Mob, already gave the answer...

Too much fuss from nothing(ala Aquila)... :grin:
invoke lstrcmp, OFFSET str1, OFFSET str2

test eax, eax
jz __equal
js __str1_is_less_than_str2
;code here for str1 is greater than str2
Quoting from PSDK
If the string pointed to by lpString1 is less than the string pointed to by lpString2, the return value is negative. If the string pointed to by lpString1 is greater than the string pointed to by lpString2, the return value is positive. If the strings are equal, the return value is zero.
And if someone still objects to this method, tell me how hard it is to implement one... :grin:

another solution is to create another hasher... probably a new variation. lemme think first. :)
Posted on 2002-09-11 17:50:29 by stryker

"In some rare cases, you *might* be able to determine the difference ('bigger'/'smaller') between two strings by comparing their CRCs, but there's no general method.
CRCs are simply too complex to be useful for string comparison (other than testing for equality)."

CRCs are not too complex for their creators and may be they
have additional publications but anyway thanks again.
CRCs are not very complex. They are the result of Boolean "division" using XOR (instead of subtract), the CRC is a Boolean "remainder". Even with ordinary division, you can't use remainders to determine which order the original dividends are in.

If you're always comparing 15 pairs of DWORDs then either you have a lot of strings that start with the same 14 DWORDs (56 bytes), or you have a very poor algorithm for comparing strings. Most people write string compares that quit comparing on the first difference. And, yes, REPE CMPSB will stop comparing on the first pair of bytes that differ.
Posted on 2002-09-11 21:22:12 by tenkey

It is impossible when you transfer data between 2 computers so
the ppl use protocols and CRC and that is no a bad programming.


You are comparing apples to oranges.

There is nothing that can be done from an application standpoint to prevent data transmission errors. Applications that receive data over a network do not create data errors - all that they can do is check for them. The errors are not the programmers' fault.

In your case, for string comparison, if you do not Strcmp() after you compare CRCs then YOU are creating a possible error. While the probability is small that two unmatching strings will have the same CRC, if an error does indeed occur, it will be YOUR fault. That is bad programming.

The checksum method would be fast for checking inequality, but if you want to be 100% certain that two strings match you are going to have to eventually use a normal string comparison algorithm. You can debate this point until you are blue in the face but you will find it is a losing battle.

Good luck ;)
Posted on 2002-09-12 06:31:11 by iblis
Thank you guys for your time.

tenkey,
you are very close to my problem.

"They are the result of Boolean "division" using XOR (instead of subtract), the CRC is a Boolean "remainder". Even with ordinary division, you can't use remainders to determine which order the original dividends are in."

My idea is to modify similar CRC algo and store information about both
"remainder" and "quotient" and compare 2 DWORDS only.

The length of my strings (represent book titles) are less than 250 bytes.
The volume is between 150 000 and 2 000 000 and I just want to improve
my binary tree Add_New_Node algo.

Regards
Posted on 2002-09-12 08:45:04 by lingo12
lingo12,
From your last post here's what I understand you are trying to do:

1) You have a database of strings (book titles)
2) You want to insert a new string
3) The string comparison is case-insensitive

Here are my suggestions:

1) Sort the database using a good sorting algo like HeapSort or QuickSort.
2) Store the sorted database
3) Once the database is sorted (or if it is already sorted) than adding a new string is not going to cost you 1000- comparisons. At worst (with 20000000) records, it'll cost you <25 comparisons by doing a binary search of the database. Last time I checked, the first 8 characters in book titles are usually not the same, so creating and storing a qword hash will not gain you anything. By the 8th character, you're bound to have some difference. Even if your comparison *is* equal up to 8 chars, than you should be happy cause it means your search is almost over.

A couple of improvements can be made over this design:

1) Create a table that has an entry for each letter of the alphabet. In each entry, store the location in the table where each letter begins in the database. This will narrow down the search area of the database considerably.
2) if you don't want to sort the whole database and bother moving all the records around, create an index table or a rank table and work with that. This can speed up the actual adding of the record since you won't have to move large parts of the database to insert the string, you can just append the string
3) write a fast string compare routine that is case-insensitive and watches for the possibility of numbers. A fast way to do this is to read a dword, or it with 20202020h and do a straight compare. This will map all uppercase to lower case, and leave lowercase and numbers intact. It will mess up punctuation though.

Maybe the others have some better ideas


--Chorus
Posted on 2002-09-12 09:24:47 by chorus
Chorus,
Your suggestions are implemented in my old project,
but the idea is possibility to substitute your "fast string compare routine"
with "fast 2 DWORDs hash compare routine" in my binary tree ADD_NODE algo.
Anyway, thank you for your efforts.

Regards
Posted on 2002-09-12 10:11:19 by lingo12