Randy Gaul's Game Programming Blog: Hash Tables: Introduction

I've just set up my first hash table after a bit of studying; it's time to write about it! Docendo discimus ~

Wikipedia has a nice definition of what a hash table is: LINK.

In computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys (e.g., a person's name), to their associated values (e.g., their telephone number). Thus, a hash table implements an associative array. The hash function is used to transform the key into the index (the hash) of an array element (the slot or bucket) where the corresponding value is to be sought.

The idea of setting up a hash table is that by using a hash function you can index an array without actually knowing what the index is.

For example say you have an array of structures that look like this:

typedef struct _ASSETFILE

{

char *fileName_;
assetFile *asset_;
} ASSETFILE;

This array holds structures that represent asset files for some sort of project (images, sounds, data, etc). Imagine you want anyone to be able to access any asset file in this array by using a function like so: FindAsset( "filename" );. How would you go about creating this FindAsset function? The easiest way would be to use a linear search starting at the beginning of the array. Simply compare the string of the parameter passed to the FindAsset function with the string within the ASSETFILE structure until a match is found.

This can be a slow search. Perhaps the next optimization would be to order the entire list alphabetically from lowest to highest. This would allow a binary search through the array by using strcmp from the standard library. There wouldn't be any faster search method than a binary search in this scenario. Inserting into an ordered list would require a binary search as well.

However by using a hash function you can achieve even faster searches. By using a hash function you can convert the filename provided to the FindAsset function directly into an index to index the array of ASSETFILE structs. The filename is what is called a key. By hashing a key an index to the hash table is generated, and the corresponding ASSETFILE struct can be accessed very quickly. It's important to note that the order in which the entries are stored into slots is random. A good hash function will randomly map keys to indices with an even distribution (avoids clustering).

The relationship between keys and indices is not a one to one ratio however. Keys to indices is of a one-to-many ratio -that is multiple keys can resolve to the same address. Each key must be unique, however, in order for the hash table to function properly.

In short, using a hash table allows for indexing of an array by some other type of data than simply an integer. There are many different hash functions available on the internet, and a simple google search will yield many results.

I've set up a demonstration program to show the usage of a hash function here. I handle collisions with something called linear probing. A collision is whenever a key maps to a location in the table that already contains an entry (two or more keys are resolving to the same slot). A collision can be resolved in many different ways, though the way I chose in my sample program was linear probing. Linear probing is simply moving down one slot in the table until a free slot is found. The entry to insert is then placed. Each time the index is incremented (or decremented, direction doesn't matter as long as your are consistent) to find an empty slot it is called a probe. This means that for each lookup you'll need to compare the key to the entry being looked at and check for a match, as when you get a hashed index it doesn't necessarily mean that the key you have had been placed into that entry; a previous key may have inserted a value there already. When attempting a lookup first check the slot the hash resolves to. If no match is found, check the next slot (using the same direction as before) until you find a key match, or an empty slot. If you find an empty slot then that means your lookup concluded that no entry matches in your table.

Linear probing forms clusters of entries, however, since you resolve collision by placing entries into the next available slot. Each entry placed into the table increases the chances of a collision, and as the table fills up a lot of time ends up being spent on probing looking for an empty slot. There are ways of breaking up these clusters like double hashing, or by having each slot point to a linked list of entries. However with a good hash function clusters can be kept to a minimum as long as the table doesn't get too full.

Load factor is a term that represents the total number of current entries divided by the table size. Once a hash table has a load factor of .7 or so linear probing starts getting dramatically slow.

Image from Wikipedia. As you can see once the load factor of the table hits around .8 the time spent probing increases dramatically.

In order to retrieve an entry from the hash table (with linear probing, as in my sample program) all you'd have to do is take your key and pass it to your hash function. Once this is done you'll have the index to start your search. Check to see if the key matches the key within the index. If it does not match then increment (or decrement, whatever direction you probe in you must do here for consistency) the index by one and compare again. Once a match is found the lookup function has done its job! If no match is found and the search function runs into an empty slot, that means that no match had been found. Here's the lookup function from the sample program:

char *TableLookup( const char *Key, char *table[] )
{
unsigned index = UHashMod( Key, TABLESIZE );
unsigned indexStart = index;

// Only search through clusters, don't hop over empty entries
while(!IsEmpty( table, index ))
{
// Break from loop if found matching entry
if(table[index] != DELETED)
{
if(strcmp(table[index], Key) == 0)
break;
}

++index;

// wraparound to beginning of table
if(index == TABLESIZE)
{
index = 0;
}

// If searched entire table
if(indexStart == index)
{
return NULL;
}
}

return table[index];
}

In the above example the data stored in the table was the same as the key, so the return type is a pointer to the stored data in this instance. In order to support the deletion of entries you must mark a slot as deleted. This is because if a slot is simply marked as empty, then it can mess up the order of searches! Searching relies on the fact that collisions are resolved with linear probing. If you delete an entry that had previous collisions, the entries next to it will not be found in searches. However if you mark slots as "deleted" with a special value, than you can modify searching to not stop on "deleted" slots, and you can modify insertion to insert values into slots that are marked "deleted". You can see in the above code that searches hop over deleted slots, but stop at empty ones.

Now lets talk about this hash function. Creating hash functions seems very difficult, but luckily for around 50 or so years research has been put into them, and as such there lots of well documented hash functions and hash libraries all over the place. Here's the one I chose to use in my demonstration program:

// Hashing function from Program 14.2 in the Sedgewick book
unsigned UHashMod( const char *Key, unsigned TableSize )
{
unsigned hash = 0; // Initial value of hash
unsigned rand1 = 31415; // "Random" 1
unsigned rand2 = 27183; // "Random" 2

while (*Key)
{
hash = hash * rand1; // Multiply hash by random
hash = (hash + *Key); // Add in current char, keep within TableSize
rand1 = (rand1 * rand2); // Update rand1 for next "random" number
Key++;
}

return hash % TableSize;
}

The hash function is simply converting the string into a random (yet consistent) interpretation as an integer. This integer is then modulo'd with the TableSize variable, which is the size of the table to be inserted into to ensure that it is placed randomly within the bounds of the table.

Here's the output of the sample hash table program I wrote. It creates a table with 157 slots (more on why I chose 157 later -hint: it's prime), and then reads a text file line by line and inserts each individual line into the table with a hash function. I then run a demonstration with a hash lookup function. I created support for deletion of entries, and the program gracefully handles the entire table being full (this is easily detected when a search goes through the array and ends up where it started by wrapping from the last element to the first):

The text file I used is provided in the source. Note that duplicate entries were handled by taking no action. I felt that there wasn't a need to hold duplicates within a hash table, and simply just reported that duplicates were found. Here are some final notes I took from one of my CS courses:

Hash tables rely on the fact that the data is uniformly and randomly distributed.
Since we cannot control the data that is provided (from the user), we must ensure that it is randomly distributed by hashing it.
Hash tables whose size is a prime number help to insure good distribution. Also, table sizes that are a power of 2 work well if the stride (second hash in double hashing) is an odd number.
Since the data is not sorted in any meaningful way, operations such as displaying all items in a sorted fashion are expensive. (Use another data structure for that operation.)

There are many different types of collision resolution for hash tables. Using chaining (linked lists) seems to be a great way to store lots of data. The chains can even be ordered to allow for optimized searching (though this is unlikely to be worth the trouble). However for a small amount of keys and a small table, linear probing should work perfectly fine. There are also other probing methods, such as quadratic probing, double hashing, pseudo random probing (seeding with the key). If probing is the collision method chosen, then support for growing the hash table like a vector may need to be implemented, depending on the requirements of the table. Chaining avoids the growing issue altogether.

I will be using the code constructed here for my next large project. It will actually be solving the hypothetical issue described at the beginning of the post! This will be very handy, as nobody will have to manually keep track of any files anymore; I'll have code traverse a single directory full of assets and then automatically add them as entries to a hash table. Lookups will be highly optimized (which is important, lookups are highly common in a game project).

Additional resources:

Randy Gaul's Game Programming Blog

Sunday, July 1, 2012

Hash Tables: Introduction

No comments:

Post a Comment

Website has Moved!

Blog Archive

Labels

Create Console Game: Series of Articles

Search This Blog

Contact, Suggestions