What would be the best (fastest) way to search for duplicated files
I have recursive function for file search, but I am tryng to implement search for duplicated files
I will consider two files duplicated if their size, file name and FILETIME are same.
But I am not sure how and where to store all that information for every single file, if for example I need to search whole hard disk, that would be a lot of informations.

any suggestions, tips, sources?... :)
Posted on 2003-03-02 20:53:07 by Mikky
You can determine a duplicate by its

1. filename - string comparison
2. use a checksum on the filename or the files contents
3. file size

I can think of a couple of ways to do it.

My idea of a 1 pass algo would be to put all the checksum values, filenames ... in a linked list on the first pass then sort the list then check all adjacent checksum values, filename... for duplicates. The downside of this is :: this consumes a huge amount of memory. Maybe you can create temprorary files for every 1000 entries or something and keep track of which files - then do the sorting ... - this would be tedious. (Bad Idea: Memory Hog)

Another solution is to make use of a hash table. This would require 2(least) to 3(*probably most*) passes depending on how you want to deal with the problem.

Just remember space(memory) and time constraints on this type of a problem are at opposing ends. You have to sacrifice space to gain speed or the other way around.
Posted on 2003-03-02 23:41:10 by arkane
Hi Arkane,

You could save alot of space by storing the PIDL and checksum in a linked list then when you get two or more equivalent checksums you would look up the info and check it. It would slow things down but conserve an incredible amount of memory.

Donkey
Posted on 2003-03-02 23:52:34 by donkey
yep, LL will conserve memory but considering some HD that really has a huge number of files, I don't know... IMO I'm with hashing (hash table)... it's consumes less memory than LL because you don't need another 4 bytes of memory per file entry( * pointer ). But that also depends on the hash table size. If it's static in size then probably you need 1 pass but it's less flexible but faster than the LL. If it's dynamic in size, probably 2 passes, I don't know about the speed. Checking whether the file is a duplicate(2nd pass using hashing) is fast when using the table compared to a LL but I think the 2 passes using the hashing technique will affect the overall performance.

Decisions, decisions... let the author try out the 2 and decide for himself. ;)

Of course, this also depends on the type hardware but I'm not going to go in there... :grin:

Time to code and benchmark (of course, it's not going to be me!!)... :grin:
Posted on 2003-03-03 00:21:59 by arkane
you could also maybe limit the type of files you look for? (I'm pretty sure you do not wish to go messing with system files and dlls)

There is code for various hashes in the FAQ forum if you need it.

you could use a database to store the hashes, they're easily searchable and don't consume much memory that way.
Posted on 2003-03-03 03:29:16 by Hiroshimator
First, scan all files. Then only keep pairs of files with equal filesizes. Then checksum these files, and by checksums determine duplicates. No need to checksum all files, only same filesizes.
Posted on 2003-03-03 16:37:55 by comrade