Most the time when someone asks a question about sorting, they want to know how to sort the largest number of elements in the shortest amount of time. My question, however, is: How would you sort a small predetermined number of elements in the shortest amount of time? For clarity's sake, let us assume we want to sort 8 elements that are within the range 00h-FFh and are stored in 8 consecutive bytes in memory. Is there a better way to sort those elements other than doing the tried-and-true compare & swap? Would it be easier to load the 8 elements into the 8 general purpose registers, sort them, and then write the elements back to memory? I have a few ideas, but I feel that I'm missing some fundamental idea that is holding me back. I appreciate all comments.

Spara
Posted on 2005-08-05 20:33:40 by Sparafusile
If you want to sort, you don't have a choice of making comparisons and swaping. However, once you have pushed an element to the end of the array as the largest (or smallest) of the elements, you don't have to compare it anymore.

Using your 8 element sort as an example, the first run would need 7 comparisons. The next one would need only 6 comparisons, the next one 5, etc. until you're down to only a single comparison for the last run.

Whether you perform those comparisons in memory or registers is the same process. However, if you have only 8 byte  elements, it would execute slightly faster to do the comparisons after loading the 8 8-bit registers, doing the comparisons and swaping within those registers, and storing the result back into memory. The drawback is that the amount of coding would be significantly larger.

Raymond
Posted on 2005-08-05 22:43:04 by Raymond
i think it can be done with 18 or fewer comparisons;  Here's an algorithm that would use 18 and about as many swaps in the worst case.

Array[1...8] to sort:
Compare item 1 /w 2, 3 /w 4, etc.  Swap when neecessary put make the smaller numbers in the odd elements.  Compare elements 1 & 3 and 5 & 7.  Compare the lower of each pair to find the lowest number.  Do the same with 2 & 4 and 6 & 8, extracting the largest number.  Remove these two elements from your array and put their original partners together such that the smaller element occupies the an index and the larger element occupies the even index.  At this point, you have six elements that you can conclude the following about:
Odd element 1 < even element 1
Odd element 2 < even element 2
Odd element 3 < even element 3
You also have the highest and loweste elements from the original list of 8 extracted.

Now, you can compare odd element (OE) 1 vs OE 2 and compare the smaller to OE 3.  Repeat for even elements (EE) to get the second largest number.  Remove these numbers and re-pair and compare again.  For example, if OE3 was the second smallest in the original list and EE2 was the second largest, you'd end up with this:
OE 1 < EE 1
OE 2 < EE3
Continuing in the fashion, i counted 18 comparisons - but it's late and i'm tired :D  *yawn*
Posted on 2005-08-06 00:09:52 by jademtech
With such a low count, is hard to do anything that doe not perform the sort quickly and conventional wisdom says you use an insertion sort for such a small number but lets face it, a bubble sort is fine for 8 items. If on the other hand you are repeating 8 items over and over again, you may need another strategy. If the output range is limited as yours is with 256 members, you could try a pidgeon hole sort where you make an array of 256 members set to zero and then test each byte at a time and increment the array location for the character position.

When you have scanned through the data, you then run through the array and display the results or save them based on if the character was used at all and then how many of each occurred.

This tecnique is much faster than compare/swap sorts but its use is very limited.
Posted on 2005-08-06 02:36:44 by hutch--
Here is what I've come up with so far:

? movzx? ? ?eax, WORD PTR
? movzx? ? ?ebx, WORD PTR
? movzx? ? ?ecx, WORD PTR
? movzx? ? ?edx, WORD PTR

? ; Sort the values
? MinMax? ? ah, al
? MinMax? ? bl, al
? MinMax? ? bh, al
? MinMax? ? cl, al
? MinMax? ? ch, al
? MinMax? ? dl, al
? MinMax? ? dh, al

? MinMax? ? bl, ah
? MinMax? ? bh, ah
? MinMax? ? cl, ah
? MinMax? ? ch, ah
? MinMax? ? dl, ah
? MinMax? ? dh, ah

? MinMax? ? bh, bl
? MinMax? ? cl, bl
? MinMax? ? ch, bl
? MinMax? ? dl, bl
? MinMax? ? dh, bl

? MinMax? ? cl, bh
? MinMax? ? ch, bh
? MinMax? ? dl, bh
? MinMax? ? dh, bh

? MinMax? ? ch, cl
? MinMax? ? dl, cl
? MinMax? ? dh, cl

? MinMax? ? dl, ch
? MinMax? ? dh, ch

? MinMax? ? dh, dl

? ; Write to memory
? shl? ? ? ?ebx, 16
? or? ? ? ? ebx, eax
? shl? ? ? ?edx, 16
? or? ? ? ? edx, ecx
? mov? ? ? ?DWORD PTR , ebx
? mov? ? ? ?DWORD PTR , edx

Where the MinMax macro places the smaller of the two operands in the first operand and the larger in the second:

Swap MACRO r1:REQ, r2:REQ
? xor? ?r1,r2
? xor? ?r2,r1
? xor? ?r1,r2
ENDM

MinMax MACRO min:REQ, max:REQ
? cmp? ?min, max
? jnc? ?@F
? Swap? min, max
? @@:
ENDM

The only way I can think of to improve this off hand is to preemptively end the sorting if none of the elements have moved position. Are there any obvious improvements that you can see?

In my case, I'm going to be running this code a maximum of about 1.7 million times so even a single extra clock cycle will be noticable. Also, my array elements are always unique so while hutch--'s idea is excellent, it might not be very useful in this case.

Spara
Posted on 2005-08-06 19:47:18 by Sparafusile
http://en.wikipedia.org/wiki/Heap_sort
Posted on 2005-08-07 03:05:25 by Criminal2
You could have a look at the radix sorting algorithm, but it might be a bit overkill for this *very* small number of elements. It's a pretty cute algorithm though, doesn't have a large additional memory requirement, and is "basically" branchless, except for loop counters.

Posted on 2005-08-07 12:31:08 by f0dder
Spara,

Unless I am reading your example wrong, what you are doing is just swapping a pair of values if one is bigger than the other and this will be faster than conventional sort methods. I would be inclined to test the XOR method you are using against using MOV to swap the values as it may be faster.
Posted on 2005-08-07 19:03:29 by hutch--
MOV ought to be faster than XOR since it doesn't have dependencies - but of course uses an additional register, or memory. RADIX sort can be fully inlined for small counts, but you do have the overhead of two 256-entry tables (count and index).
Posted on 2005-08-07 19:07:53 by f0dder
Another possability is to just use a branchless MinMax macro:

MinMax_BL MACRO min:REQ, max:REQ, tmp1:REQ, tmp2:REQ
? mov? ?tmp2, max
? sub? ?tmp2, min
? sbb? ?tmp1, tmp1
? and? ?tmp1, tmp2
? sub? ?max,? tmp1
ENDM

Of course, I'll need two additional free registers to do that. Maybe I'll try sorting the first two elements the "hard" way and then branchless MinMax for the last 6 elements. I could switch to mov swaps by sorting the first element the "hard" way as well. I'm afraid this may increase the number of times I need to write to memory wich could cancel out any benefit these methods garner anyway.

f0dder: I belive the Radix sort is what hutch-- suggested in his first post. Unfortunately, I know for sure that each of the elements I'm sorting will be unique so the additional overhead doesn't seem worth the trouble of implementing this algorithm.

Spara
Posted on 2005-08-07 20:23:24 by Sparafusile
MinMax MACRO min:REQ, max:REQ
cmp  min, max
jle  @F
xchg min, max  ; use xchg instruction directly, and then, MACRO of Swap is not needed
@@:
ENDM

Posted on 2005-08-07 20:45:34 by zara
Uses 24 calls to MinMax.  i'm sure this can be tweaked further.  Also, MinMax maybe could use a look-up table to eliminate the need for extra registers if the range of numbers were smaller and fit in cache; say, 00h through 10h.

MinMax al,ah ;al<ah
MinMax bl,bh ;bl<bh
MinMax cl,ch ;cl<ch
MinMax dl,dh ;dl<dh

MinMax al,bl ;al<(ah,bl<bh)
MinMax ah,bh ;al<(ah,bl)<bh
MinMax cl,dl ;cl<(ch,dl<dh)
Minmax ch,dh ;cl<(ch,dl)<dh

MinMax al,cl ;al<((ah,bl)<bh,cl,(ch,dl)<dh)
MinMax bh,dh ;al<(ah,bl,bh,cl,ch,dl)<dh

MinMax dl,ah ;al<(dl<ah,bl,bh,cl,ch)<dh
MinMax bl,bh ;al<(dl<ah,bl<bh,cl,ch)<dh
MinMax cl,ch ;al<(dl<ah,bl<bh,cl<ch)<dh

MinMax dl,bl ;al<(dl<(ah,bl,bh),cl<ch)<dh
MinMax dl,cl ;al<dl<(ah,bl,bh,cl,ch)<dh

MinMax ah,bh ;al<dl<(ah<bh,bl,cl,ch)<dh
MinMax cl,ch ;al<dl<(ah<bh,bl,cl<ch)<dh
MinMax bl,bh ;al<dl<((ah,bl)<bh,cl<ch)<dh
MinMax bh,ch ;al<dl<(ah,bl,cl,ch)<bh<dh

MinMax bl,ah ;al<dl<(bl<ah,cl,ch)<bh<dh
MinMax cl,ch ;al<dl<(bl<ah,cl<ch)<bh<dh
MinMax bl,cl ;al<dl<bl<(ah,cl<ch)<bh<dh
MinMax ah,ch ;al<dl<bl<(ah,cl)<ch<bh<dh
MinMax ah,cl ;al<dl<bl<ah<cl<ch<bh<dh
Posted on 2005-08-08 00:15:47 by jademtech
Spara-

Have you tested your routine?  When I tried it with

array  db 255, 156, 129, 99, 31, 14, 3, 0

The results were

99 31 14 3 0 255 156 129

I'll try to find where I went wrong, just wanted to know if it works for you.
Posted on 2005-08-08 10:27:41 by JimG
Got it!

You need to change your test after the comparison-

MinMax MACRO min:REQ, max:REQ
cmp  min, max
jnc @f
Swap  min, max
@@:
ENDM

jle doesn't work the way you want for bytes > 127.

Now for some timing tests ;)
Posted on 2005-08-08 11:44:19 by JimG
Thanks JimG. I've been programming a long time, but still have trouble choosing between the Jcc opcodes. Thanks for the fix.

Here's the same code except it will preemtively exit the sort if no elements were swapped it'll give a better "best-case" which improves the average runtime:

Swap MACRO r1:REQ, r2:REQ
? xor? ?r1,r2
? xor? ?r2,r1
? xor? ?r1,r2
ENDM

MinMax MACRO min:REQ, max:REQ
? cmp? ?min, max
? jnc? ?@F
? Swap? min, max
? mov? ?edi, 1
? @@:
ENDM

? movzx? ? ?eax, WORD PTR
? movzx? ? ?ebx, WORD PTR
? movzx? ? ?ecx, WORD PTR
? movzx? ? ?edx, WORD PTR

? ; Sort the values
? xor? ? ? ?edi, edi
? MinMax? ? ah, al
? MinMax? ? bl, al
? MinMax? ? bh, al
? MinMax? ? cl, al
? MinMax? ? ch, al
? MinMax? ? dl, al
? MinMax? ? dh, al
? cmp? ? ? ?edi, 0
? je? ? ? ? SORT_DONE

? xor? ? ? ?edi, edi
? MinMax? ? bl, ah
? MinMax? ? bh, ah
? MinMax? ? cl, ah
? MinMax? ? ch, ah
? MinMax? ? dl, ah
? MinMax? ? dh, ah
? cmp? ? ? ?edi, 0
? je? ? ? ? SORT_DONE

? .
? .
? .

? MinMax? ? dl, ch
? MinMax? ? dh, ch

? MinMax? ? dh, dl

SORT_DONE:
? ; Write to memory
? shl? ? ? ?ebx, 16
? or? ? ? ? ebx, eax
? shl? ? ? ?edx, 16
? or? ? ? ? edx, ecx
? mov? ? ? ?DWORD PTR , ebx
? mov? ? ? ?DWORD PTR , edx

Spara
Posted on 2005-08-08 11:47:36 by Sparafusile
It would seem that your latest algo would leave your unsorted if the first item happens to be smaller than the other 7 items. Or am I missing something.

Raymond
Posted on 2005-08-08 13:30:29 by Raymond
Clear edi before doing each round of sorting. The MinMax macro sets edi to 1 if a swap was performed. After each round of sorting is complete, if edi is 0 then preemtively end the sorting routine. Rinse, repeat...

Spara
Posted on 2005-08-08 13:46:17 by Sparafusile
As Raymond said, if the first element in register al is the lowest, no swaps take place, so edi remains zero, and you stop sorting, even though the rest of the elements are not in sort.  To use your idea for early termination, you would have to do a bubble-like sort, al to ah, ah to bl, bl to bh,...,dl to dh and then if no swap took place, you are done.
Posted on 2005-08-09 09:10:50 by JimG
I see. Silly me:

Swap MACRO r1:REQ, r2:REQ
xor  r1,r2
xor  r2,r1
xor  r1,r2
ENDM

MinMax MACRO min:REQ, max:REQ
cmp  min, max
jnc  @F
Swap  min, max
mov  edi, 1
@@:
ENDM

movzx    eax, WORD PTR
movzx    ebx, WORD PTR
movzx    ecx, WORD PTR
movzx    edx, WORD PTR

; Sort the values
xor      edi, edi
MinMax    dh, dl
MinMax    dl, ch
MinMax    ch, cl
MinMax    cl, bh
MinMax    bh, bl
MinMax    bl, ah
MinMax    ah, al
cmp      edi, 0
je        SORT_DONE

xor      edi, edi
MinMax    dh, dl
MinMax    dl, ch
MinMax    ch, cl
MinMax    cl, bh
MinMax    bh, bl
MinMax    bl, ah
cmp      edi, 0
je        SORT_DONE

xor      edi, edi
MinMax    dh, dl
MinMax    dl, ch
MinMax    ch, cl
MinMax    cl, bh
MinMax    bh, bl
cmp      edi, 0
je        SORT_DONE

.
.
.

MinMax    dh, dl
MinMax    dl, ch

MinMax    dh, dl

SORT_DONE:
; Write to memory
shl      ebx, 16
or        ebx, eax
shl      edx, 16
or        edx, ecx
mov      DWORD PTR , ebx
mov      DWORD PTR , edx

That should work then.

Spara
Posted on 2005-08-09 09:23:05 by Sparafusile
This may NOT work then.

As a start, if AH is smaller than AL, you will swap it into AL with the first MinMax. But that could be higher than some (or all) of the other bytes in BX, CX, or DX and you never compare AL anymore.

Raymond
Posted on 2005-08-09 11:10:35 by Raymond