I recently, ah...tripped over :) the checksum algorithm microsoft uses for their PE files. To use this to set the correct checksum on a PE file, you need to first zero out the old checksum, then call this function on *the entire file*, and store the new checksum value in the header. Optimizations (Svin?) gladly accepted.
MicrosoftCheckSum PROC C uses esi, buf:dword, len:dword mov ecx, ; buffer length shr ecx, 1 ; we're summing WORDs, not bytes mov esi, ; esi is pointer to our buffer base xor edx, edx ; high word must be zero xor eax, eax ; EAX holds the checksum @@theloop: mov dx, ; get word add eax, edx ; add to sum ; add high and low parts of EAX together mov edx, eax and eax, 0FFFFh shr edx, 16 ; this clears the high part (needed in "add eax, edx" in the beginning of the loop) add eax, edx dec ecx jnz @@theloop ; add high and low parts of EAX together, restrain to 16 bits. mov edx, eax shr edx, 16 add eax, edx and eax, 0FFFFh ; and finally, add the length. Presto, that was it. add eax, ret MicrosoftCheckSum ENDP
Here is two procs, the first only needs to be called once, to initlize the Crc tables. The second works much faster because it uses previously definded varibles. If someone (The Svin) can optimize this one it should be much faster. I have tried, but I aint no expert. (This code can be assmebled directly into a .lib file)
This message was edited by George, on 4/24/2001 8:13:17 PM
.386 .model flat,stdcall .data? crctab DWORD 256 dup(?) .code ; Precomputes a 256*4 CRC array InitCrc32Tbl proc uses ebx esi edi lea edi,crctab + (255*4) xor edx,edx ;mov edx,-1 dec edx std CalcTblValues: mov eax,edx mov ebx,0edb88320h ; winzip polynominal push 8 pop ecx ReflectBitsCompute_Loop: shr eax,1 sbb esi,esi and esi,ebx xor eax,esi dec ecx jz ReflectBitsCompute_Loop stosd dec edx jns CalcTblValues cld ret InitCrc32Tbl endp ; Calculates a 32 Bit Cyclic Redundancy Code ; over a buffer with length SIZEOF(buffer), ; using a precomputed CRC table. InitCRC ; must be called *once* before. ; Speed: ~7 MB/s on a Intel PII 300 Crc32 proc uses esi edi lpData:DWORD,bLen:DWORD mov esi,lpData lea edi,crctab mov ecx,bLen xor eax,eax dec eax CalcCRC: xor edx,edx mov dl,byte ptr inc esi xor dl,al shr eax,8 xor eax,dword ptr dec ecx jnz CalcCRC not eax ret Crc32 endp end
Until you add the length at the end, the result at the end of each itteration of the loop must be 16 bits long: worst case: say word read in (dx) = 0FFFFh checksum (eax) = 0 eax = dx + eax = 0FFFFh hi + lo eax = 0FFFFh ***(next itteration)*** word read in = 0FFFFh (again) checksum = 0FFFFh eax = 0FFFFh + 0FFFFh = 1FFFEh hi + lo = 1 + 0FFFEh = 0FFFFh So the result is guaranteed to be 16bit! This means we can get rid of the
Well it saves a few clocks! It also gives the possibiltiy of just using the following code for the checksum:
; add high and low parts of EAX together, restrain to 16 bits. mov edx, eax shr edx, 16 add eax, edx and eax, 0FFFFh
Of course Svin will kill me for using ax, but I don't care :P And given I wrote this, it could be COMPLETELY wrong, so someone better check :P I am at work, so I cannot really do any real testing... :( Well thats enough of my jibber-jabbering, I AM THE WEAKEST LINK. GOODBYE! Mirno
mov ecx, len ; buffer length shr ecx, 1 ; we're summing WORDs, not bytes mov edx, buf ; edx is pointer to our buffer base xor eax, eax ; EAX holds the checksum @@: add ax, adc ax, 0 dec ecx jnz @B add eax,
This inner-loop is surely faster, but I wonder about my logic. Haven't test, at work, yadda, yadda, yadda...
This message was edited by bitRAKE, on 4/25/2001 4:53:09 PM
@@: adc ax, ;ESI is pointer to buffer adc dx, 0 dec ecx ;this doesn't effect the carry flag jnz @B adc ax, dx add eax,
As "dec" doesn't affect the carry flag, so you can do this:
Kind of using Bitrake's adc in one go, but delaying the carry until the next loop in order to remove one instruction (at the expense of the "clc" and changing the "add eax, len" to "adc eax, len" outside the loop). Again no testing here, work yadda shouldn't even be doing this yadda, not what I get paid for yadda :P *** Corrected code a little, plus removed one instruction *** As I made a REALLY STUPID mistake! *** DOH! *** Who would have thought that the code could be so optimised? Mirno This message was edited by Mirno, on 5/15/2001 7:15:12 AM
mov ecx, len ; buffer length mov edx, buf ; edx is pointer to our buffer base shr ecx, 1 ; we're summing WORDs, not bytes xor eax, eax ; EAX holds the checksum clc ; Clear the carry flag ready for later... ; The above have been re-arranged for pairing! @@: adc ax, dec ecx jnz @B adc eax, len
Mirno, nice work - even without the pay. :) I thought of that too, but I was worried about all the carries rolling over and didn't work out the math to prove they were equivalent - I just couldn't see it without thinking it was going to create problems in unique situations. I wish they'd just let me install a few things here on the harddrive. :) They watch my every move like a hawk! :( XOR clears the carry flag.
Hey guys, good job on optimizing the routine. Can't believe that I stared myself so blind at the original microsoft umm...source ;)...that I forgot to think about the algorithm itself, and failed to realize that the result would always be 16bit. I deserve a good old slapping. I've included some timings that I've run on my athlon700, checksumming MSHTML.DLL , the largest DLL file I could find in my windows\system directory. "calcCheckSum" is the C version of the algorithm (converted directly from the asm version I posted here), MicrosoftCheckSum is the version posted here, and number two is mirno's version. Ok, speed might not matter all that much on a specialized checksum algorithm like this (it's *not* suitable for anything but PE checksums, imho), but it's alway fun to see how fast one can go. I'm still pondering whether there's any "awfully smart" way to do it, but I really like mirno's version, it's so wonderfully short and elegant. Thumbs up. `calcCheckSum ': 34934ms (34.934sec) -- 2000 checks, 58 checks/second `MicrosoftCheckSum ': 29998ms (29.998sec) -- 2000 checks, 68 checks/second `MicrosoftCheckSum2': 21012ms (21.12sec) -- 2000 checks, 95 checks/second
I think you could sum DWORDS and then add the two WORD halves outside the loop, but this algo really should be optimized for size, IMHO.
I was thinking exactly the same thing :D ! You can modify the code pretty easily, just do the following:
I was also trying to think of a way to use "len" only once at the begining of the file, so it wouldn't need to mess around with the stack pointers (size & 1 or 2 clocks :) ). If I come up with a way to do that I'll post that too. Mirno
mov ecx, len ; buffer length shr ecx, 2 ; we're summing DWORDs, not bytes mov edx, buf ; edx is pointer to our buffer base xor eax, eax ; EAX holds the checksum @@: adc eax, dec ecx jnz @B mov ecx, eax and eax, 0FFFFh shr ecx, 16 add eax, ecx adc eax, len
Ok, if this is going in a proc, then this could be slightly shorter, as it manually deals with the stack, and never "creates the varaibles as locals" so to speak! The main reason this can be done is because the "length" variable is dealt with at the begining, and so we can simply pop "length" and "buffer" off the stack! This should save a couple of bytes. This is done at the expense of a couple of extra instructions to deal with this "pre addition". I haven't checked the code, and it may be wrong, but here it is anyway....
F0dder, could you post the code you used to test these algos? Or email it to me at "email@example.com". I would like to try to fiddle with my algo a little more, and I'm too lazy to write my own code :) Thanks Mirno
ChkSum PROTO length:DWORD, buffer:DWORD ChkSum PROC pop ecx ; Length pop edx ; Buffer mov eax, ecx shr ecx, 1 dec ecx clc @@: adc ax, dec ecx jnz @B adc ax, rcr ecx, 16 add eax, ecx ret ChkSum endp
Me too f0dder, bitRAKE@home.com. I haven't got around to coding an algo speed tester, but I should be getting an Athlon 1.333Ghz in a couple days. :) I can't afford any memory right now, though.
Mmmm, 1.33ghz :-]. Anyway, my speed testing is REALLY REALLY REALLY simple. Two time GetTickCount, and a large amount of loops. No use of those fancy performance MSRs or anything like that. My testing code is a real-world PE checksum fixer, with the speed testing worked in, so it's a bit lengthy to post here. I could write up a little asm thing to do it, though; it doesn't *really* matter what data you're testing the algo on :).
Hi f0dder, It is my faster variant....
buliaNaza proto ;USAGE: push lenBuffer ; ; push offset Buffer ; ; call buliaNaza ; buliaNaza proc pop eax ;u return address pop edx ;v a pointer to our buffer base pop ecx ;u buffer length push eax ;v return address push esi ;u save some registers push edi ;v push ebx ;u push ecx ;v save buffer length xor edi,edi ;u carry flag counter & clear carry flag (clc) mov esi,edx ;v esi is pointer to our buffer base shr ecx,2 ;u we're summing DWORDs, not bytes mov eax,0FFFF0000h ;v EAX holds the checksum bNaza: adc edi,0 ;u inc carry flag counter mov edx, ;v get dword mov ebx,edx ;u copy it in ebx and edx,0FFFF0000h ;v low word must be zero shl ebx,16 ;u low word must be zero add eax,edx ;v add to sum adc edi,0 ;u inc carry flag counter add eax,ebx ;v add to sum dec ecx ;u buffer length / 4 jne bNaza ;v if ecx > 0 loop again adc edi,0 ;u inc carry flag counter for last op pop edx ;v restore buffer length shr eax,16 ;u EAX holds the checksum add edi,edx ;v add buffer length add eax,edi ;u add to sum pop ebx ;v restore some registers pop edi ;u speed in cycles: 14 + 3*N pop esi ;v N = word ; 14 -> const. cycles ret ; 3*N -> 3 cycles per word buliaNaza endp ; tested for Pentium200 MMX and Buffer is ; aligned to 4
The method above can be converted to operate on DWORDs (indeed in a later post it was :P ). This uses a very small loop where the two "work" instructions pair, and should only therefore take 2 clocks, plus the length of the jnz... The main problem with all of these algos is more to do with memory speed, than any design of the algo (unless you do some REALLY stupid design :D ). Create a small app that opens a file, and computes the checksum on it. Then modify the app to do it several (hundred) times, then divide the clocks accordingly... This will remove the "disk" accessing speeds, but your still left with the caching problems! This will show that the disk speed cripples any "efficiency" you'll gain from code design, so the "best" use of the assembly in this case I would argue would be in terms of size. One point to note is that you could form the wrong answer due to your addition of the carries to the length, rather than adding carries to the 16bit checksum, adding the carry to that, then adding the length. For example:
But its nearly right! **** Extra! **** I also note that you shr ecx AFTER you use your xor edi, edi.. This can set the carry bit again (if the length is not divisible by 2), this is only technically a problem, as if the length is not divisible by 4, then the result is wrong anyway!!! Mirno This message was edited by Mirno, on 6/1/2001 7:03:04 AM
;eax = FFFF0000h ;edi = 1 ;edx = 1FFFFh (length) shr eax, 16 ; eax = 0FFFFh add edi, edx ; edi = 20000h add eax, edi ; eax = 2FFFFh ;Result cannot be this! Add any 16bit number to 1FFFFh and the ;maximum it can be is 2FFFEh ;In this case the result should be 20000h
Thanks Mirno for fast replay, Plz note I use the rules from the f0dder's post and the fact that the length of every .DLL or .EXE file is divisible by 4. Plz f0dder, could you test the progi and post the results? Thanks..