While developing h2incn I needed a way to insert and search for defined values quickly. Using my recently developed FNV-1 hash algorithm I've created a hash map (basically an array of pointers indexed by the hash value) with collisions now handled by a Binary Tree.

The files can be found at:

http://sourceforge.net/projects/h2incn/

The Binary Tree assembles in both 32-bit and 64-bit Linux and Windows. It is not a balanced tree nor does it permit duplicates but it's perfect for this application. Node deletion is not completed yet but will be in the next few days. All assembly files use Nasm syntax, of course ;)

Download the files bintree.asm and bintree.inc for the implementation. Download hashmap.c and hashmap.h which shows it's usage.

This project relates well with the NASMX project. And yes, you can now download the tarball and compile and execute h2incn as it's slightly more stable although not feature complete yet. I'm still trucking along with development...

Is it worthwhile using a binary tree instead of a simple linked-list data structure? The purpose of the H(x) is to reduce collisions as much as possible; how many collisions are you getting?

My goal was simply to provide a Nasm syntax assembly implementation of a Binary Tree for myself and anybody else that might benefit from it. It is implemented using recursion but is quite fast. I'm not overly concerned with shaving micro-seconds off execution time at this point.

With regard to collisions I've noticed that the hash algorithm I'm currently using is not providing the proper spread across the hash map. I have created a 32K pointer array hash map with each slot potentially holding a root pointer to a binary tree. After testing I noticed that collisions are occuring almost 50% of the time, thus a file containing 1000 defines has approximately 500 collisions occuring. A 32K map provides plenty of potential space and collisions at such a low watermark irks my 'taters a bit! :shock:

So I'm pondering either the possibility that a) The hash algorithm is truly not designed for 16-bit hashs; b) My conversion from 32-bit to 16-bit is flawed, or c) I need a better algorithm more suitable for 16-bit hashing.

I originally ruled out CRC16 as too heavy for my requirements but may reconsider. If anyone has suggestions regarding any of the above I'd appreciate your input.

No wonder your hashing algorithm produces 50 percent collisions when trying to map 32K unique entries using 16 bit resolution - notice that you only have 64K possible one-to-one mappings!

I don't think *any* hashing algorithm will get around the fact that your keyspace is inadequate for the number of entries you are trying to map. Is there some reason that you insist on 16 bit keys? If you had stuck with the original 32bit keyspace you would have a 1 in 4.3 billion chance of colliding (regardless of the complexity of the hashing algo).. is the cost of storing an extra 2 bytes per entry really that bad?

I don't think *any* hashing algorithm will get around the fact that your keyspace is inadequate for the number of entries you are trying to map. Is there some reason that you insist on 16 bit keys? If you had stuck with the original 32bit keyspace you would have a 1 in 4.3 billion chance of colliding (regardless of the complexity of the hashing algo).. is the cost of storing an extra 2 bytes per entry really that bad?

No wonder your hashing algorithm produces 50 percent collisions when trying to map 32K unique entries using 16 bit resolution - notice that you only have 64K possible one-to-one mappings!

Yes, which is more than enough for my purposes. However, I think you misunderstood the problem domain (or I didn't describe it in enough detail). For parsing a C header I'm not expecting to find 32 thousand defines and typedefs in a header file to hash even when recursing included headers. However, the system itself is quite capable of maintaining billions of tokens (restricted only by available memory). Each define is initially hashed and inserted into the hashmap for future lookup. The hash map itself is preallocated to allow 32K ptrs to root tree tokens. The purpose of the Binary Tree is to handle any hash collisions occurring from use of the hash algorithm. Using a binary tree for searching previously inserted entries will always be faster than iterating through a linked list, unless of course the entries are added in sequential order in which case the performance would be identical. Again this should not be occurring in any "normal" C header file.

The concept is as follows:

Hash Map Root Tree Pointer

0x0000 0x00000000 TOKEN_C

0x0001 0x00000000 /

0x0002 ---> 0x004A1234 ---> TOKEN_B

0x0003 \

. TOKEN_A

.

.

0x7FFD 0x00000000

0x7FFE 0x00000000

0x7FFF 0x00000000

The crux of the issue is that the hashing algorithm, when run against a file containing only 1000 defines, was generating hashes that collide almost 50% of the time. I would expect that only a 3% utilization of available hash space would not be incurring such a high rate of collisions, even when using "only" 15 bit hashes. Would you consider this an unrealistic expectation?

No, the probability for collisions grows proportionally to the consumption of the keyspace, so at 100% utility we have 100% chance of a collision .. and regardless of the hashing algo employed, we would expect around 3% collisions when the keyspace has been 3% consumed... your expectations are quite reasonable.

It must be said that a more complex hashing algo simply provides a better distribution of the hashes over the keyspace - it has no effect on the PROBABILITY of a collision, given RANDOM input. However if there is any linearity between the input data and the hashing algorithm, this will indeed increase the probability.

There must be something seriously wrong with your hash implementation - even an additive CRC (which is PURELY linear) would be less prone to collisions than this - I would take a CLOSE look at the implementation, line by line. If you can't solve it, i would point you toward any one of the pseudo-RNG algorithms as a good way to delineate (or even replace!) your hashing algo.

I actually did quite a bit of research into probability distribution analyses of RNG's a couple of years back, and developed a custom RNG based on MTB and ParkMiller for the purpose of encrypting data. Basically I needed to create an algo which generated a very well-distributed hash based on input data which I then used in subsequent steps as a cypher key - it would be suitable for this problem domain.

It must be said that a more complex hashing algo simply provides a better distribution of the hashes over the keyspace - it has no effect on the PROBABILITY of a collision, given RANDOM input. However if there is any linearity between the input data and the hashing algorithm, this will indeed increase the probability.

There must be something seriously wrong with your hash implementation - even an additive CRC (which is PURELY linear) would be less prone to collisions than this - I would take a CLOSE look at the implementation, line by line. If you can't solve it, i would point you toward any one of the pseudo-RNG algorithms as a good way to delineate (or even replace!) your hashing algo.

I actually did quite a bit of research into probability distribution analyses of RNG's a couple of years back, and developed a custom RNG based on MTB and ParkMiller for the purpose of encrypting data. Basically I needed to create an algo which generated a very well-distributed hash based on input data which I then used in subsequent steps as a cypher key - it would be suitable for this problem domain.

Is it worthwhile using a binary tree instead of a simple linked-list data structure? The purpose of the H(x) is to reduce collisions as much as possible; how many collisions are you getting?

You won't even need that. You can also use what is known as a 'probing hashtable'.

For example, I use a linear probing hashtable in my 3D Engine for various caches. It generates a table index from the hash, and then checks if that spot is already occupied. If it is, it will increase the index by 1 and repeat. It will continue to search linearly until it finds an empty spot. Retrieving an item works the same way. You search until you either find the item, or find an empty spot in the table. For deletion, you use a special deleted pointer value, so that searching will not break.

A variation on that is the quadratic probing table. Instead of increasing the pointer by 1 everytime, it will take index*index and generate a new index from that.

I prefer to use the probing approach over linked lists. It has slightly less overhead.

We also have a version of this in ObjAsm, I call it the "holey array method" - it's not the best solution for all cases either but yeah, in this situation it probably deserves visiting.

Pure linked lists are crap for sorted arrays, we appreciate this... and so too for BINARY trees, where we can clearly separate the subspaces we should aim higher than 2... why should we NOT apply , say, QuadTree, or any other NARY tree approach to hashing, if we can?

Pure linked lists are crap for sorted arrays, we appreciate this... and so too for BINARY trees, where we can clearly separate the subspaces we should aim higher than 2... why should we NOT apply , say, QuadTree, or any other NARY tree approach to hashing, if we can?

Scali, I'm aware of the probing approach - but what do you do if all the slots are filled? Also, the probing can end up pretty slow when you have an almost-full table.

As it often is, the more you know about your data the better algorithms you can choose - probing isn't bad if you have good H(x) that gives a sparsely populated table.

As it often is, the more you know about your data the better algorithms you can choose - probing isn't bad if you have good H(x) that gives a sparsely populated table.

If you can afford a single bit you could indicate if the slot overflowed once*, and hence, even if the array is full you would be able to stop before N steps. On non large-address-aware programs you could use the highest bit of the pointer to string for this.

*The bit could be cleared if when deleting you scan the table for the last key whose hash value matches the released slot and move the key to it or clear the bit if none was found.

PS: Scali, your algo use circular scan or just scans up to bottom?

*The bit could be cleared if when deleting you scan the table for the last key whose hash value matches the released slot and move the key to it or clear the bit if none was found.

PS: Scali, your algo use circular scan or just scans up to bottom?

Scali, I'm aware of the probing approach - but what do you do if all the slots are filled? Also, the probing can end up pretty slow when you have an almost-full table.

Well, you almost provided the answer yourself already: Don't let the table become almost-full.

Performance of a hashtable will always degrade as it gets fuller. A linked list is no different. You'll lose the O(1) advantage that caused you to use a hashtable in the first place.

At each insert, I check the number of entries, and double the size of the table when I insert the N/2nd element, where N is the size of the table. I also try to choose an initial table size that avoids resizing, whenever possible.

PS: Scali, your algo use circular scan or just scans up to bottom?

It starts at the index generated by the hash, then wraps around if it passes the end of the table. So that would be circular. Since the table can never be full (see post above), it will always find an empty spot eventually.

The collision problem is much worse than one might think. According to the Birthday Paradox, a hash table only 16% full will have a 99% chance of a collision. A hash table 55% full will have a 99.9999999999999999999999999998% chance of a collision. This is assuming you're using a perfect hashing method.

Wolfshook, but what is the change of that happening many times? (i.e. after finding a pair/collision, what are the odds of finding, say, log(N) pairs that collide in the whole set)

I'm expecting it to be less but I'm not sure how to calculate, though...

I'm expecting it to be less but I'm not sure how to calculate, though...

Wolfshook, but what is the change of that happening many times? (i.e. after finding a pair/collision, what are the odds of finding, say, log(N) pairs that collide in the whole set)

I'm expecting it to be less but I'm not sure how to calculate, though...

I see what you're getting at.

You want to look at Collision Counting. But I gotta be honest, math isn't my strong point.

50% chance of a collision when hash table is 6% utilised.

50% chance of 2-way collision when hash table is 24% utilised (not including entries in the collision resolution subset.)

50% chance of 3-way collision when hash table is 85% utilised.

I agree with what you're alluding to. I was just getting at that collision resolution

*does*need to be efficient.

Now, apparently, it should be something lightweight and cheap. Maybe a BSR is bit heavy handed, depending on what you're working on? Although linked lists are susceptible to "Algorithmic Complexity" denial of service attacks.

There is an update available to the Binary Tree routines in file bintree.asm available at:

http://sourceforge.net/projects/h2incn/

A bug fix to the binarytree_insert_node function fixes a parent-child pointer issue and I've added the code for binarytree_delete_node() for 32 and 64-bit platforms. So basically that particular subpackage is now complete. You can read the usage and assembling instructions as they are documented within the source file itself.

Don't be alarmed by the size of the actual source file itself (56KB). Although there are over 2000 lines of code and comments remember that there are 3 different implementations within the same file: one for 32-bit using the C calling convention applicable to both Linux and Windows, one for Linux 64-bit fastcall, and one for Windows 64-bit fastcall. Only the actual code for the target platform will be assembled. For instance, the actual binary size of the assembled routines is only 877 bytes for a 32bit implementation using Nasm with optimizations enabled.

There are a few optimizations that can be made to the package such as iteration rather than recursion or shrinking the size of the _bst_node_t structure knowing that the key/ value pairs immediately follow it but I leave those exercises to the reader. The source code itself provides a working x86 and x64 implementation compatible with Linux and Windows that is well documented, fast, and provides students of assembly an excellent example of programming without using more advanced techniques. Of course, for a few pints of Guinness I can be persuaded to write a more optimized version ;)

One technique I do use that you may find appealing is the single allocation of memory that contains both the node structure itself as well as the variable length data that a node may contain. This enables the entire node to be fetched into the CPU cache during memory reads and, provided the lengths of the key/value pair are small enough, reduces thrashing.

http://sourceforge.net/projects/h2incn/

A bug fix to the binarytree_insert_node function fixes a parent-child pointer issue and I've added the code for binarytree_delete_node() for 32 and 64-bit platforms. So basically that particular subpackage is now complete. You can read the usage and assembling instructions as they are documented within the source file itself.

Don't be alarmed by the size of the actual source file itself (56KB). Although there are over 2000 lines of code and comments remember that there are 3 different implementations within the same file: one for 32-bit using the C calling convention applicable to both Linux and Windows, one for Linux 64-bit fastcall, and one for Windows 64-bit fastcall. Only the actual code for the target platform will be assembled. For instance, the actual binary size of the assembled routines is only 877 bytes for a 32bit implementation using Nasm with optimizations enabled.

There are a few optimizations that can be made to the package such as iteration rather than recursion or shrinking the size of the _bst_node_t structure knowing that the key/ value pairs immediately follow it but I leave those exercises to the reader. The source code itself provides a working x86 and x64 implementation compatible with Linux and Windows that is well documented, fast, and provides students of assembly an excellent example of programming without using more advanced techniques. Of course, for a few pints of Guinness I can be persuaded to write a more optimized version ;)

One technique I do use that you may find appealing is the single allocation of memory that contains both the node structure itself as well as the variable length data that a node may contain. This enables the entire node to be fetched into the CPU cache during memory reads and, provided the lengths of the key/value pair are small enough, reduces thrashing.