Here's a trick using xor-

MinMax3 MACRO min:REQ, max:REQ
  cmp  min, max
  jnc @f
  mov  edi, esi
  xchg min,max
  @@:
ENDM

Tx equ MinMax3

align 16
Test3 proc
  ; Read from memory
  movzx    eax, WORD PTR
  movzx    ebx, WORD PTR
  movzx    ecx, WORD PTR
  movzx    edx, WORD PTR
  mov esi,1
  mov edi,0

  ; Sort the values
  Tx ah,al
  Tx bl,ah
  Tx bh,bl
  Tx cl,bh
  Tx ch,cl
  Tx dl,ch
  Tx dh,dl
  xor edi,esi ; are they the same
  jnz SORT_DONE ; no, so swap was not done, else resets edi to zero
  Tx ah,al
  Tx bl,ah
  Tx bh,bl
  Tx cl,bh
  Tx ch,cl
  Tx dl,ch
  xor edi,esi
  jnz SORT_DONE
  Tx ah,al
  Tx bl,ah
  Tx bh,bl
  Tx cl,bh
  Tx ch,cl
  xor edi,esi
  jnz SORT_DONE
  Tx ah,al
  Tx bl,ah
  Tx bh,bl
  Tx cl,bh
  xor edi,esi
  jnz SORT_DONE
  Tx ah,al
  Tx bl,ah
  Tx bh,bl
  xor edi,esi
  jnz SORT_DONE
  Tx ah,al
  Tx bl,ah
  Tx ah,al

SORT_DONE:
  ; Write to memory
  shl      ebx, 16
  or        ebx, eax
  shl      edx, 16
  or        edx, ecx
  mov      DWORD PTR , ebx
  mov      DWORD PTR , edx

  ret
Test3 EndP


eliminates the xor edi,edi after the terminating condition test.
Posted on 2005-08-09 14:53:58 by JimG
Could you not also simply do:

MinMax3 MACRO min:REQ, max:REQ
? cmp? ?min, max
? jnc @f
? mov? ?edi, 1
? xchg min,max
? @@:
ENDM

and

...
xor edi, 1
jnz SORT_DONE
...

Would it be any slower/faster that way? I dislike using the index registers so the fewer the better in my opinion.

Spara
Posted on 2005-08-09 15:24:43 by Sparafusile
Yes, of course.  The reason I did it this way is that  mov edi,1 take 5 bytes vs. 2 bytes for mov edi,esi.  Similarly,  xor edi,1 takes 5 bytes vs. 2 for xor edi,esi.  But even more important than the 3 extra bytes is the fact that it is an odd number of bytes, so the following codes start on an odd address rather than an even address until the next test.  It usually doesn't make much difference, but I've found on this screwy athlon of mine that code alignment sometimes makes a big difference.  Time it both ways and make your choice.  Probably not much difference on a pentium, and your xor's are probably faster on a pentium than the xchg I used also.

Originally, you wanted to save all the instruction cycles you could, so a little rearranging of your wrapup saves 1-2 cycles on my cpu-

SORT_DONE:
  ; Write to memory
  shl      ebx, 16
  shl      edx, 16
  or        ebx, eax
  or        edx, ecx
  mov      DWORD PTR , ebx
  mov      DWORD PTR , edx

Posted on 2005-08-09 19:00:06 by JimG
Thanks for the tips JimG. I still don't know enough to know how large one opcode translation is compared to another, but maybe eventualy I'll figure it out.  I do know about instruction pairing, so I should have caught your second hint myself.

Just FYI, I ran the above sorting routine (the first one I posted) with the entire algorithm I'm working on 20,000,000 times with the worst case input and it took about half a second on my P4 2.4G. It's taken me 2 months to fully implement and test the code so it was quite a relief that the first time I tested it, it was already plenty fast.

Thanks for your help everybody.

Spara
Posted on 2005-08-09 21:04:20 by Sparafusile