Hello all,
I'm looking for an algo for a problem I only know how to solve in a brute-force manner.
The problem is this: I have an array of word sized values, and I want to be able to determine a unique number that is not in this array.
I have come up with two solutions, both of which work but aren't especially clever:

Solution 1
------------
Let x=1
Check entire list if x is in the list
If it isn't we're done
Else, increment x and check if the new value of x is in the list
etc.

Solution 2
-------------
Set aside 8kb (65536 bits) an initialize it to zero. Go through each value in the list and set the corresponding bit in the buffer to 1. Once done, scan for a clear bit (indicating that the index is not in use).

As you can see, both solutions are pretty poor. But I'm stumped. I thought about looping over the list and keeping a variable that represents the current max or min, but both 0 and 65535 occur frequently in the list. Also, the list is not guaranteed to be sorted in any meaningful way.

Any suggestions are appreciated
Thanks
--Chorus
Posted on 2003-05-12 21:40:29 by chorus
If the array is sorted, scan it from the beginning for numbers that skip. ie, scan until the current number being tested is at least 2 more than the previous number. If it is, pick a number between them and you have your missing number.

If the array isn't sorted, or you don't want to sort it, go with the latter method.
Posted on 2003-05-12 22:34:06 by iblis
Yeah, you would have to sort the array if you wanted any kind of speed at all. You could then use a search algorithm that made sense... say check the middle of the array, if your number is greater check the middle of the upper half, if your number is greater etc... the search would be pretty quick to zero in on a match.
Posted on 2003-05-12 22:44:47 by donkey
Okay, what are the possible ranges of "X". Is the array sorted?

If it is sorted, then scan in a linear fashion to the first ___space___ ... ie. arrayvalue - arrayvalue >1
Choose array value + 1

Also, check for an available value at the start of the list.
Posted on 2003-05-12 22:45:06 by V Coder
Thanks for the replies guys :)

The array is not guaranteed to be sorted. I wish it were, it would be pretty trivial then...

Each index can be any value in the range [0,65535], however, the array will have << 65536 entries, so there *will* be a free index available somewhere.

My first instinct was to do something like this (pseudocode):

var Array[1..n]

var x = 0
for (i = 1 to n) { x = max (x,Array)};

i.e., get x to be the maximum index stored in the array. Then our unused index is x + 1. This can be done pretty efficiently. It trips up though if x = 65535 is in the list though (which is a reserved value and shows up often) .


Thanks again for the ideas though

--Chorus
Posted on 2003-05-12 22:53:42 by chorus
If it is very important to have this unique number then either maintain a sorted array, or use a tree - I prefer using a tree structure. Touching every entry in the array is a sure show stopper for finding a unique number with any speed - the only solution is to change the data structure.
Posted on 2003-05-12 23:04:09 by bitRAKE
Why don't you make it sorted then?

As data is entered, sort it...

How about this? Create a linked list kind of thing, with forward and backward indices allowing traversal through the sorted list... As data is added, insert it into the correct position in the list. As data is deleted adjust the forward and backward links.

This way the data is not interfered with and you have the best of both worlds... Especially when you want to do garbage collection.

Oops. I think I am going beyond myself... I just looked and reminded myself it was an array of word size values...

Well, have a separate buffer, and create a heap structure - this way you automatically have an idea of the next available ... Heaps are difficult, but quicker than linear search... I have information on heaps in a box somewhere...

Easiest solution. Use a separate buffer, quicksort the elements and then search through as indicated in previous posts.
Posted on 2003-05-12 23:08:09 by V Coder
This is a great chance to learn what a Square Array data structure is!
Posted on 2003-05-12 23:17:17 by bitRAKE
Dunno what a square array data structure is... I'll look into it, though...

Anyways, the dataset is a little more complicated than a simple array of words, though the details are really inconsequential. But here's what's going on anyways.

I have a DB of entries of varying size. Each entry in the DB has an id and a parent. Every item's parent is the id of another item in the DB, and so we have a tree structure without storing much about the tree. The benefits of this are: very fast moving of branches (just change the parent of your top most node), and it cuts back on transient and redundant data. The file is also extremely easy to manage.

So every node is supposed to have a unique ID so that child nodes and leaves will graft on appropriately. When I go to add a new node, I want to have an ID that isn't already in use. Hence my problem. Of course, I knew that this would be the case, but adding items isn't a major priority b/c it happens infrequently compared to other operations. Regardless, it'd be nice to have a simple, clean way of getting the unique ID.

Sorting isn't really an option b/c I can probably do the "bitmask" approach faster. I'm sure it's an O(N) operation which is bound to be better than any sort.

My best idea so far actually is a variation of the max solution. The problem being signed values. So I figure add 32768 to each value, calculate the max on the resulting unsigned values, and then shift the result back by subtracting 32768. I think that will work reasonably well and I'll only have to go through the array once

--Chorus
Posted on 2003-05-12 23:30:18 by chorus
When you compare two 16 bit numbers, they are not automatically treated as signed.

Several flags are set, which allow you to determine if one is greater if they are considered signed numbers, or if one are above (unsigned numbers).

JA jump if above (unsigned) carry flag=0 [, zero flag=0, therefore not equal]. equivalent to JNBE
JAE jump if above or equal (unsigned) carry flag=0. equivalent to JNC, JNB
JB jump if below (unsigned) carry flag=1. equivalent to JC, JNAE
JBE jump if below or equal (unsigned) carry flag=1 or zero flag =1. equivalent to JNA

JG jump if greater (signed) sign flag= overflow flag or zero flag=0. equivalent to JNLE
JGE jump if greater than or equal to (signed) sign flag=overflow flag. equivalent to JNL
JL jump if less than (signed) sign flag not = overflow flag. equivalent to JNGE
JLE jump if less than or equal to (signed) sign flag not = overflow flag or zero flag=1. equivalent to JNG

No need to add 32768 or anything...
Posted on 2003-05-13 00:05:40 by V Coder

This is a great chance to learn what a Square Array data structure is!

Hey, bitRAKE, don't leave us hanging... Give us a hint.
Posted on 2003-05-13 00:07:09 by V Coder
DEFINITION: (Square Array.) A square array of order n is a two dimensional array of n*n storage elements indexed by row and column. Each possible combination of row and column values will select a storage position which can hold a single symbol or value. http://www.ciphersbyritter.com/BBM.HTM

But what is meant by Square Array Data Structure - a particular application of a square array?

Or is it: if he has n elements in the convoluted array, make it n^2?
Posted on 2003-05-13 00:16:53 by V Coder
V Coder,
I understand how comparisons work... the reason I want to go unsigned is because you can code max (a,b) for unsigned without using a jump and the associated penalties for making conditional jumps (Particularly those that fail prediction). I don't think you can do this with signed values, although honestly I haven't looked into it, yet. Hence my initial thought to "shift" everything up by 32768

Thanks though

--Chorus
Posted on 2003-05-13 07:40:29 by chorus
I don't know the max (a, b) instruction - Is that ASM?

I wonder if the penalties for branching are as severe in your case. If it is a program to do significant number crunching in parallel, much optimization of the loop is needed. If on the other hand, the operator makes a choice and the program enters information etc, programmining it fully in Win32 assembler instead of HLL or mixed will probably have greater speed benefit than avoiding branches... Let's say the penaly is 1000 clock cycles (greatly exaggerating), then that is 1 millionth of a second... only matters much if you have several hundred thousand missed branches before responding to the user.

But there is a quick algorithm to choose max (a,b) without branching that I saw somewhere. I'll look for it.
Posted on 2003-05-13 14:51:39 by V Coder
V Coder,
for max (a,b) without jump you can look here in
this thread.

To be honest, the speed is not truly that important. Currently, I'm using the 1st solution in my OP. It works fine, and you'd never know it was a horrible approach. I'm mostly curious if there is a really good way of doing this. People around here have a way of pulling great algos out of nowhere ;)

--Chorus
Posted on 2003-05-13 17:18:20 by chorus
Dumb idea:

why don't you set a second array of 65536 bytes ?

At the beginning, you clear the array, then tag all the used values, then set an InternalPointer to 0.
When you need to allocate a new value, you check secondarray
if it's not 0, ++InternalPointer and loop
Otherwise, set SecondArray, and you have your value.

Here is the code in C pseudocode:
Init()
{
memset(SecondArray,0,65536);
for(i=0;i<sizeof(Array);++i) SecondArray.index]=1;
InternalPointer=0;
}

GetFirstFree()
{
while(SecondArray) ++InternalPointer;
SecondArray=1;
return InternalPointer++;
}

JC
Posted on 2003-05-13 17:24:56 by MCoder
Chorus,

Just keep a list of unused IDs. In the interest of memory, you could add a range field to the list. When you add an element to the DB, modify the ID list.

So start out with an ID list that represents the number of adjacent unused IDs. (a full ID list would be a single element with range values 0 - 65535) When you add an element to the DB, getting a free ID # will take no time at all. When you delete an element from the DB, mark the ID as free and join it with any adjacent free IDs.

Worst case scenario is if you delete every other element from the DB and the ID list gets fragmented, but it will still never be much bigger than your DB.

For example, if you had 4 items in your DB with the IDs #3, #40, #71 and #72, your ID list would look something like:

[0 - 2]-->[4 - 39]-->[41 - 70]-->[73 - 65535]
Posted on 2003-05-13 17:26:48 by iblis
iblis,

If you use a list to keep the ranges, the worst-case is O(n^2) (since you may have to traverse the whole list to insert each element if they are sorted). If you use a heap to store the ranges you end up with O(n log n), which is better -- but you might as well sort the list. Depending on the distribution of values, it might be better to keep ranges of present values instead.

I think the bit-array is the obvious linear-time solution which does not use too much memory.

Another silly idea would be to randomize the search .. if there are substantially less than 64k values, it is not very likely to choose one of them (i.e. you should not have to try many values before you find one that is not present) .. but the worst-case is of course horrible :grin:.

It all depends a lot on how many values there actually are.
Posted on 2003-05-14 03:24:29 by Jibz
Maybe you didn't understand the idea.

Unless you are selective about which ID number you give to an item, getting an unused ID from the list will take O(1). If you go for either the smallest unused ID or the largest, it is simply a matter of modifying the beginning or the end of the list to reflect the changes. It's only when you delete an item from the database, thereby freeing its ID number, that you might have to traverse the whole ID list to find out where to reinsert the number, and if possible, coalesce it with adjacent ranges, which will take about O(n). I suppose the question is how often will items be deleted from the DB.

But since there will only be 65536 ID numbers, almost any of the previously mentioned ideas would work without much noticeable performance penalties.
Posted on 2003-05-14 15:14:29 by iblis
:stupid:

I didn't read the posting about the DB stuff, so I was still thinking about the basic problem of finding a value not present in an array :grin:
Posted on 2003-05-15 02:44:26 by Jibz