Sunday, July 1, 2012

Hash Tables: Introduction

I've just set up my first hash table after a bit of studying; it's time to write about it! Docendo discimus ~

Wikipedia has a nice definition of what a hash table is: LINK.
In computer science, a hash table or hash map is a data structure that uses a hash function to map identifying values, known as keys (e.g., a person's name), to their associated values (e.g., their telephone number). Thus, a hash table implements an associative array. The hash function is used to transform the key into the index (the hash) of an array element (the slot or bucket) where the corresponding value is to be sought.

The idea of setting up a hash table is that by using a hash function you can index an array without actually knowing what the index is.


For example say you have an array of structures that look like this:

typedef struct _ASSETFILE
{

  char *fileName_;
  assetFile *asset_;
} ASSETFILE;

This array holds structures that represent asset files for some sort of project (images, sounds, data, etc). Imagine you want anyone to be able to access any asset file in this array by using a function like so: FindAsset( "filename" );. How would you go about creating this FindAsset function? The easiest way would be to use a linear search starting at the beginning of the array. Simply compare the string of the parameter passed to the FindAsset function with the string within the ASSETFILE structure until a match is found.


This can be a slow search. Perhaps the next optimization would be to order the entire list alphabetically from lowest to highest. This would allow a binary search through the array by using strcmp from the standard library. There wouldn't be any faster search method than a binary search in this scenario. Inserting into an ordered list would require a binary search as well.


However by using a hash function you can achieve even faster searches. By using a hash function you can convert the filename provided to the FindAsset function directly into an index to index the array of ASSETFILE structs. The filename is what is called a key. By hashing a key an index to the hash table is generated, and the corresponding ASSETFILE struct can be accessed very quickly. It's important to note that the order in which the entries are stored into slots is random. A good hash function will randomly map keys to indices with an even distribution (avoids clustering).

The relationship between keys and indices is not a one to one ratio however. Keys to indices is of a one-to-many ratio -that is multiple keys can resolve to the same address. Each key must be unique, however, in order for the hash table to function properly.


In short, using a hash table allows for indexing of an array by some other type of data than simply an integer. There are many different hash functions available on the internet, and a simple google search will yield many results.


I've set up a demonstration program to show the usage of a hash function here. I handle collisions with something called linear probing. A collision is whenever a key maps to a location in the table that already contains an entry (two or more keys are resolving to the same slot). A collision can be resolved in many different ways, though the way I chose in my sample program was linear probing. Linear probing is simply moving down one slot in the table until a free slot is found. The entry to insert is then placed. Each time the index is incremented (or decremented, direction doesn't matter as long as your are consistent) to find an empty slot it is called a probe. This means that for each lookup you'll need to compare the key to the entry being looked at and check for a match, as when you get a hashed index it doesn't necessarily mean that the key you have had been placed into that entry; a previous key may have inserted a value there already. When attempting a lookup first check the slot the hash resolves to. If no match is found, check the next slot (using the same direction as before) until you find a key match, or an empty slot. If you find an empty slot then that means your lookup concluded that no entry matches in your table.


Linear probing forms clusters of entries, however, since you resolve collision by placing entries into the next available slot. Each entry placed into the table increases the chances of a collision, and as the table fills up a lot of time ends up being spent on probing looking for an empty slot. There are ways of breaking up these clusters like double hashing, or by having each slot point to a linked list of entries. However with a good hash function clusters can be kept to a minimum as long as the table doesn't get too full.



Load factor is a term that represents the total number of current entries divided by the table size. Once a hash table has a load factor of .7 or so linear probing starts getting dramatically slow.


Image from Wikipedia. As you can see once the load factor of the table hits around .8 the time spent probing increases dramatically.


In order to retrieve an entry from the hash table (with linear probing, as in my sample program) all you'd have to do is take your key and pass it to your hash function. Once this is done you'll have the index to start your search. Check to see if the key matches the key within the index. If it does not match then increment (or decrement, whatever direction you probe in you must do here for consistency) the index by one and compare again. Once a match is found the lookup function has done its job! If no match is found and the search function runs into an empty slot, that means that no match had been found. Here's the lookup function from the sample program:

char *TableLookup( const char *Key, char *table[] )
{
  unsigned index = UHashMod( Key, TABLESIZE );
  unsigned indexStart = index;

  // Only search through clusters, don't hop over empty entries
  while(!IsEmpty( table, index ))
  {
    // Break from loop if found matching entry
    if(table[index] != DELETED)
    {
      if(strcmp(table[index], Key) == 0)
        break;
    }

    ++index;

    // wraparound to beginning of table
    if(index == TABLESIZE)
    {
      index = 0;
    }

    // If searched entire table
    if(indexStart == index)
    {
      return NULL;
    }
  }

  return table[index];
}

In the above example the data stored in the table was the same as the key, so the return type is a pointer to the stored data in this instance. In order to support the deletion of entries you must mark a slot as deleted. This is because if a slot is simply marked as empty, then it can mess up the order of searches! Searching relies on the fact that collisions are resolved with linear probing. If you delete an entry that had previous collisions, the entries next to it will not be found in searches. However if you mark slots as "deleted" with a special value, than you can modify searching to not stop on "deleted" slots, and you can modify insertion to insert values into slots that are marked "deleted". You can see in the above code that searches hop over deleted slots, but stop at empty ones.


Now lets talk about this hash function. Creating hash functions seems very difficult, but luckily for around 50 or so years research has been put into them, and as such there lots of well documented hash functions and hash libraries all over the place. Here's the one I chose to use in my demonstration program:

// Hashing function from Program 14.2 in the Sedgewick book
unsigned UHashMod( const char *Key, unsigned TableSize )
{
  unsigned hash = 0;      // Initial value of hash
  unsigned rand1 = 31415; // "Random" 1
  unsigned rand2 = 27183; // "Random" 2

  while (*Key)
  {
    hash = hash * rand1;     // Multiply hash by random
    hash = (hash + *Key);    // Add in current char, keep within TableSize
    rand1 = (rand1 * rand2); // Update rand1 for next "random" number
    Key++;
  }

  return hash % TableSize;
}

The hash function is simply converting the string into a random (yet consistent) interpretation as an integer. This integer is then modulo'd with the TableSize variable, which is the size of the table to be inserted into to ensure that it is placed randomly within the bounds of the table.


Here's the output of the sample hash table program I wrote. It creates a table with 157 slots (more on why I chose 157 later -hint: it's prime), and then reads a text file line by line and inserts each individual line into the table with a hash function. I then run a demonstration with a hash lookup function. I created support for deletion of entries, and the program gracefully handles the entire table being full (this is easily detected when a search goes through the array and ends up where it started by wrapping from the last element to the first):

Dump table contents (length 157 char pointers):
      * NULL
      * NULL
      * NULL
        Lexicon
        Tiger Express
        Tiny Tim
        Beer cans
        Bottle
        I LEIK TOITLES
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
        Teeter Totter
      * NULL
      * NULL
        test2
      * NULL
      * NULL
        das
        Snickerdoodle
        So many probes >.<
      * NULL
      * NULL
      * NULL
      * NULL
        test
      * NULL
        Baha Blast
      * NULL
        WTF
      * NULL
        SC Original
      * NULL
      * NULL
      * NULL
      * NULL
        Schism
      * NULL
      * NULL
      * NULL
      * NULL
        asd
        Tylanol
      * NULL
      * NULL
      * NULL
      * NULL
        Contraceptor
      * NULL
  --->  DELETED
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
        Dignity
        popo
        BrooD WaR
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
        Quadruple
      * NULL
        Typo
      * NULL
      * NULL
        alphabet
        StarCraft 2
        Rover Dover
        Extra power
        phantom
      * NULL
      * NULL
      * NULL
        Mountaineous
      * NULL
        skdjf
        a
        b
      * NULL
        Carbon
        This is a string!
      * NULL
      * NULL
      * NULL
        Alexis
      * NULL
      * NULL
        DIuasplOis
        Duplicate
        Contraption
      * NULL
      * NULL
      * NULL
        Beasty Boys
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
      * NULL
        Turdy Nerdy
        Meek
      * NULL
        Plasma
      * NULL
        asdf
        Turtles
        Flayva Flave
        Bumble Bee
      * NULL
      * NULL
      * NULL
      * NULL
        OMG
      * NULL
      * NULL
      * NULL
        tab
      * NULL
        LOL
        tad
      * NULL
        <3
        Lionheart
      * NULL
        Nonchalant
        Supercalifragilisticexpialladocious
      * NULL
      * NULL
        I lover yous
      * NULL
      * NULL
        Digestion
        Avalanche

Table lookup demonstration: <key index>
Lexicon 3 | Tiger Express 3 | Tiny Tim 5 | Beer cans 3 | Bottle 5 | I LEIK TOITL
ES 6 | Teeter Totter 31 | test2 34 | das 37 | Snickerdoodle 38 | So many probes
>.< 39 | test 44 | Baha Blast 46 | WTF 48 | SC Original 50 | Schism 55 | asd 60
| Tylanol 60 | Contraceptor 66 | Dignity 74 | popo 75 | BrooD WaR 76 | Quadruple
 82 | Typo 84 | alphabet 87 | StarCraft 2 88 | Rover Dover 88 | Extra power 90 |
 phantom 91 | Mountaineous 95 | skdjf 97 | a 97 | b 98 | Carbon 101 | This is a
string! 102 | Alexis 106 | DIuasplOis 109 | Duplicate 110 | Contraption 111 | Be
asty Boys 115 | Turdy Nerdy 123 | Meek 123 | Plasma 126 | asdf 128 | Turtles 128
 | Flayva Flave 129 | Bumble Bee 129 | OMG 136 | tab 140 | LOL 142 | tad 142 | <
3 145 | Lionheart 146 | Nonchalant 148 | Supercalifragilisticexpialladocious 148
 | I lover yous 152 | Digestion 155 | Avalanche 156 |

Error count: 0
Probe count: 19
Entry Count: 58
Table size: 157
Duplicate entries: 2
Load factor: 0.369427

The text file I used is provided in the source. Note that duplicate entries were handled by taking no action. I felt that there wasn't a need to hold duplicates within a hash table, and simply just reported that duplicates were found. Here are some final notes I took from one of my CS courses:
  • Hash tables rely on the fact that the data is uniformly and randomly distributed.
  • Since we cannot control the data that is provided (from the user), we must ensure that it is randomly distributed by hashing it.
  • Hash tables whose size is a prime number help to insure good distribution. Also, table sizes that are a power of 2 work well if the stride (second hash in double hashing) is an odd number.
  • Since the data is not sorted in any meaningful way, operations such as displaying all items in a sorted fashion are expensive. (Use another data structure for that operation.)
There are many different types of collision resolution for hash tables. Using chaining (linked lists) seems to be a great way to store lots of data. The chains can even be ordered to allow for optimized searching (though this is unlikely to be worth the trouble). However for a small amount of keys and a small table, linear probing should work perfectly fine. There are also other probing methods, such as quadratic probing, double hashing, pseudo random probing (seeding with the key). If probing is the collision method chosen, then support for growing the hash table like a vector may need to be implemented, depending on the requirements of the table. Chaining avoids the growing issue altogether.


I will be using the code constructed here for my next large project. It will actually be solving the hypothetical issue described at the beginning of the post! This will be very handy, as nobody will have to manually keep track of any files anymore; I'll have code traverse a single directory full of assets and then automatically add them as entries to a hash table. Lookups will be highly optimized (which is important, lookups are highly common in a game project).


Additional resources:

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.