I am looking for the fastest way to compare two or more files by their contents (assuming they are all of the same size). So I will have one proc with variable number of arguments which will be ptrs to file paths.

Several solutions came to my mind,

1. Regular CreateFile/ReadFile to first open and then read all files into memory. Then I can do comparing, but this way doesnt seem to be very optimized when it comes to very large files.
This way files would always be completely loaded into memory ,even when only 1st byte is different - looks like effective waste of CPU and memory.

2. CreateFile, then ReadFile() first say 1mb from all files which are to be compared, then compare them and if contents is the same then repeat this in loop until all contents is compared. When first difference encounters stop further processing and return FALSE.
I think this is much better way, little bit complicated to implement though.

3. Memory mapped files, because of the way MMF are implemented it appears to me that this is the fastest way, since files contents will be accessed on demand and thus when first contents difference is encounter I can stop reading and return FALSE. Similiar like 2nd way, only here I left Windows to do the complicated work which I would have had to do myself.

4. CreateFile, then ReadFileEx() - very complicated for me since I never didn async I/O coding, but could this be the best way? (I hope its not, otherwise I'd have to learn all that stuff :grin: )

Any ideas?
Posted on 2004-08-13 17:01:14 by Mikky
maybe filemapping? iczelion has a tutorial of it. you should check it out.
Posted on 2004-08-13 17:13:21 by HeXeN
Hi Mikky might be easier just to get the CRC of the two files, it is unlikely (though remotely possible) to have two different files with the same CRC.
Posted on 2004-08-13 17:25:02 by donkey
HeXeN: Yea thanks for your thought and suggestion, appreciate that, but I already know how to use MMF, and I was looking for some more argumented answers.

donkey, I considered that CRC32 way but then I give up from that. I think its the slowest method (like the 1st idea up), since you will always have to calculate CRC of the file no matter how file is big. Now cosider you have 2 files of 100mb which are different only in first several bytes. If I start comparing them, then I would realise they are different after first byte comparin, and procedure work is done and FALSE would be returned. But with CRC approach, proc would first have to calculate CRC of both files and then compare them and then return FALSE.
Posted on 2004-08-13 17:25:12 by Mikky

1. Regular CreateFile/ReadFile to first open and then read all files into memory. Then I can do comparing, but this way doesnt seem to be very optimized when it comes to very large files.
This way files would always be completely loaded into memory ,even when only 1st byte is different - looks like effective waste of CPU and memory.
Keep in mind that files are always loaded in blocks. The minimum block size is one disk sector, but for convenience, Windows uses the 4K paging size as its basic block size.

Disk access is incredibly slow compared to memory access. So when you must look at every byte, the more data you load at one time, the faster your code executes. If you run out of RAM (not the same as memory space), you'll experience the slowdown of paging, because paging accesses the disk.
Posted on 2004-08-13 20:08:12 by tenkey
Mikky,

It basically depends on the size of the files you want to compare, if they are less than a few meg each, just load both into memory and do a byte compare on them and the code is very simple to write. If you have to compare much larger files, you basically do the same but read the two files in preset blocks and do the comparison on each pair of blocks.

On any hardware in the last 10 years that runs 32 bit windows, you can safely use a 1 meg buffer to do the testing and the performance will be pretty good if you do it right.

Regards,

hutch at movsd dot com
Posted on 2004-08-13 20:18:49 by hutch--
You could use asynchronous file operations, so that you can compare blocks while reading from the files.
You mentioned it as your fourth option. I believe that would be the fastest solution.
Posted on 2004-08-13 21:17:44 by death

You could use asynchronous file operations, so that you can compare blocks while reading from the files.
You mentioned it as your fourth option. I believe that would be the fastest solution.


Only for fairly large files. Async has a performance impact in the initial setup and on the waits (more kernel mode transitions).

You could check the file size, if it is under 1 meg, read it all in one go, if not use 1 meg chunks read in by async IO or memory mapped files.
Posted on 2004-08-13 22:11:45 by Mecurius
Don't do memory mapped files - you won't be able to compare huge files this way. Besides, memory mapped files are a bit slower (not all that much, but every cycle counts :D ).

If you're going to deal with large file, and only need to compare for equality (ie, you don't need a list of changed bytes), then I would probably do something like the following...

Async reads with large buffers, perhaps a megabyte or more. Compute a CRC32 or MD5 (or other secure hash) of the block while you wait for the next block to arrive. Disk I/O is slow, so if you get it going correctly you should be able to process a block before the next one arrives.

You don't have to process a full hash of each file, you can either finalize the hash of each block and compare those, or compare the immediate results from the two files (which gives you the benefit of listing the hash of the file when done).
Posted on 2004-08-14 09:14:42 by f0dder

Don't do memory mapped files - you won't be able to compare huge files this way.


Actually I think MMF in combination with large files is pretty good combination. AFAIK they were created for this reason primarly.
The thing is that I just need to find first difference in contents to return FALSE, thus it is likely that I wont need to check and load whole file. So I think the question here has been narrowed to which is better
1. memory mapping files
2. async IO
Posted on 2004-08-14 13:04:47 by Mikky
i dont understand what the CRC32/MD5 for? :confused:

i remember that i have done it this way:

init (may be slow :P )
1. get files' sizes and compare
2. allocate 4 buffers (size of the buffer = MIN ( 0.5*free_psysical_mem ; size_of_largest_file )
3. asynch read of 1st file(part) to 1st buffer
4. asynch read of 2nd file(part) to 2nd buffer
5. asynch read 1st file(part) to 3rd buffer
6. synch read 2nd file(part) to 4th buffer

loop (must be fast :) ):
7. byte-compare buffers: 1 and 2 (preferably with SSE or at least MMX, 3Dnow)
8. asynch read 1st file(part) to 1st buffer
9. asynch read 2nd file(part) to 2nd buffer
10. byte-compare buffers: 3 and 4 (preferably with SSE or at least MMX, 3Dnow)
11. asynch read 1st file(part) to 3rd buffer
12. synch read 2nd file(part) to 4th buffer
13: go to 7


this works well for me :alright:
Posted on 2004-08-14 13:10:08 by ti_mo_n
oops :P sorry - my memory suxx :P it was:

init (may be slow :P )
1. get files' sizes and compare
2. allocate 4 buffers (size of the buffer = MIN ( 0.5*free_psysical_mem ; size_of_largest_file )
3. asynch read of 1st file(part) to 1st buffer
4. synch read of 2nd file(part) to 2nd buffer


loop (must be fast :) ):
5. asynch read 1st file(part) to 3rd buffer
6. synch read 2nd file(part) to 4th buffer
7. byte-compare buffers: 1 and 2 (preferably with SSE or at least MMX, 3Dnow)
8. asynch read 1st file(part) to 1st buffer
9. synch read 2nd file(part) to 2nd buffer
10. byte-compare buffers: 3 and 4 (preferably with SSE or at least MMX, 3Dnow)
11: go to 5



now it's ok :)
Posted on 2004-08-14 13:14:32 by ti_mo_n

Actually I think MMF in combination with large files is pretty good combination. AFAIK they were created for this reason primarly.

Well, of course it will work if you MapViewOfFile+UnmapViewOfFile - but the maximum amount of a file you can map at once depends on the address space - and on 9x it is EVEN WORSE - since on 9x MMF are allocated in the shared memory area...


The thing is that I just need to find first difference in contents to return FALSE, thus it is likely that I wont need to check and load whole file. So I think the question here has been narrowed to which is better

Oki doki.
ASYNC should be the best, since you can do a lot of calculations "for free", in the time period where SYNC would block waiting for the data. It's perhaps a bit difficult using async in this case, since you're dealing with two files and need to have two blocks ready at once before you can do anything...
Posted on 2004-08-14 13:19:49 by f0dder
It's perhaps a bit difficult using async in this case, since you're dealing with two files and need to have two blocks ready at once before you can do anything...


that's why i'm using 4 buffers. each has its own Event object which is set to signaled, after the load is complete

and - just after i finish comparing 2 buffers - i wait for teh appropriate events and continue by staring new asynch file access and comparing these new buffers
Posted on 2004-08-14 13:37:56 by ti_mo_n
Do something like a CRC but work with DWORDs at a time (or if you want to get fancy, use MMX or SSE to do comparisons).

Example HLL psuedo code:
hFile1 = get a open handle to first file
hFile2 = get a open handle to second file

dword chksum1 = 0
dword chksum2 = 0

while (!eof hFile1 && !eof hFile2 && dword1 == dword2)
read in 4k buffer from hFile1
read in 4k buffer from hFile2
loop through 4k buffer
add each 4 bytes into the dword total

close file handles for both files

Something like this would be a OK method. Not the most effiecient but effective.

Relvinian
Posted on 2004-08-14 14:14:50 by Relvinian
i see no advantage of using a checksum/hash. if you calculate a checksum over one block, then over another block and compare the checksums, why not just comparing the blocks?
Posted on 2004-08-14 15:25:35 by Mbee
I agree with xlifewirex, comparing block is way faster than CRC-ing.
Posted on 2004-08-14 18:24:18 by Mikky
#1 - creating a hash can be "almost free" since you're waiting for a disk read to complete.

#2 - comparing blocks require both blocks to be read into memory.


... perhaps two threads could create hashes in just the time it takes to read the two blocks... or something.
Posted on 2004-08-14 20:03:19 by f0dder