Here's a trick using xor-
eliminates the xor edi,edi after the terminating condition test.
MinMax3 MACRO min:REQ, max:REQ
cmp min, max
jnc @f
mov edi, esi
xchg min,max
@@:
ENDM
Tx equ MinMax3
align 16
Test3 proc
; Read from memory
movzx eax, WORD PTR
movzx ebx, WORD PTR
movzx ecx, WORD PTR
movzx edx, WORD PTR
mov esi,1
mov edi,0
; Sort the values
Tx ah,al
Tx bl,ah
Tx bh,bl
Tx cl,bh
Tx ch,cl
Tx dl,ch
Tx dh,dl
xor edi,esi ; are they the same
jnz SORT_DONE ; no, so swap was not done, else resets edi to zero
Tx ah,al
Tx bl,ah
Tx bh,bl
Tx cl,bh
Tx ch,cl
Tx dl,ch
xor edi,esi
jnz SORT_DONE
Tx ah,al
Tx bl,ah
Tx bh,bl
Tx cl,bh
Tx ch,cl
xor edi,esi
jnz SORT_DONE
Tx ah,al
Tx bl,ah
Tx bh,bl
Tx cl,bh
xor edi,esi
jnz SORT_DONE
Tx ah,al
Tx bl,ah
Tx bh,bl
xor edi,esi
jnz SORT_DONE
Tx ah,al
Tx bl,ah
Tx ah,al
SORT_DONE:
; Write to memory
shl ebx, 16
or ebx, eax
shl edx, 16
or edx, ecx
mov DWORD PTR , ebx
mov DWORD PTR , edx
ret
Test3 EndP
eliminates the xor edi,edi after the terminating condition test.
Could you not also simply do:
and
Would it be any slower/faster that way? I dislike using the index registers so the fewer the better in my opinion.
Spara
MinMax3 MACRO min:REQ, max:REQ
? cmp? ?min, max
? jnc @f
? mov? ?edi, 1
? xchg min,max
? @@:
ENDM
and
...
xor edi, 1
jnz SORT_DONE
...
Would it be any slower/faster that way? I dislike using the index registers so the fewer the better in my opinion.
Spara
Yes, of course. The reason I did it this way is that mov edi,1 take 5 bytes vs. 2 bytes for mov edi,esi. Similarly, xor edi,1 takes 5 bytes vs. 2 for xor edi,esi. But even more important than the 3 extra bytes is the fact that it is an odd number of bytes, so the following codes start on an odd address rather than an even address until the next test. It usually doesn't make much difference, but I've found on this screwy athlon of mine that code alignment sometimes makes a big difference. Time it both ways and make your choice. Probably not much difference on a pentium, and your xor's are probably faster on a pentium than the xchg I used also.
Originally, you wanted to save all the instruction cycles you could, so a little rearranging of your wrapup saves 1-2 cycles on my cpu-
SORT_DONE:
; Write to memory
shl ebx, 16
shl edx, 16
or ebx, eax
or edx, ecx
mov DWORD PTR , ebx
mov DWORD PTR , edx
Originally, you wanted to save all the instruction cycles you could, so a little rearranging of your wrapup saves 1-2 cycles on my cpu-
SORT_DONE:
; Write to memory
shl ebx, 16
shl edx, 16
or ebx, eax
or edx, ecx
mov DWORD PTR , ebx
mov DWORD PTR , edx
Thanks for the tips JimG. I still don't know enough to know how large one opcode translation is compared to another, but maybe eventualy I'll figure it out. I do know about instruction pairing, so I should have caught your second hint myself.
Just FYI, I ran the above sorting routine (the first one I posted) with the entire algorithm I'm working on 20,000,000 times with the worst case input and it took about half a second on my P4 2.4G. It's taken me 2 months to fully implement and test the code so it was quite a relief that the first time I tested it, it was already plenty fast.
Thanks for your help everybody.
Spara
Just FYI, I ran the above sorting routine (the first one I posted) with the entire algorithm I'm working on 20,000,000 times with the worst case input and it took about half a second on my P4 2.4G. It's taken me 2 months to fully implement and test the code so it was quite a relief that the first time I tested it, it was already plenty fast.
Thanks for your help everybody.
Spara