How a hash table works and resolving hash collisions

Rediet Abere
5 min readMar 8, 2020

--

Img src: from google imges

What I know now

My first encounter with the word hash was in my second Computer Science class. I thought it was a vague word, full of complex concepts. At this very moment, my instinct is telling me I only know about hash tables at the surface level I hope that by the end of this article I will have a crystal clear understanding of the hash table and its implementation.

Definition

A hash table is a data structure that uses a key to store and retrieve it’s associated value by using, as Wikipedia puts it, associative array abstract data type. How does that work? Well, let’s consider simple data with no duplicate hash code, meaning for every key we pass to the hash function, we get a unique hash code. Then the hash code is used to calculate the index where the value will be stored.

What is a hash function?

The idea of a hash table revolves around the hash function, which takes in a key as an argument to return an integer representation of that key. We don’t stop there. We still need to find an index to store the value of the key based on the integer representation. That is when we perform the modulo operation by dividing it into the size of the array.

That’s pretty technical, let’s talk metaphor so we can wrap our head around the concept. In biology, a sperm should successfully pass through the female reproductive organ to meet the egg and produce a zygote with a different DNA sequence that will eventually become a young born. The young born is unique from its siblings. When we bring it back to our hash function, the sperm is the key, the egg is the hash function, and the zygote with different DNA sequences is the hash code.

Two sperms are known to fertilize a single egg and result in Twins.

When the hash function returns the same hashcode for two different keys collision occurs. There are multiple collision resolution strategy. In this article, we will touch on linear probing: a type of open addressing, chaining, and dynamic resizing.

  1. Linear Probing: Sometimes we might end up with a key that points value to an index already occupied. When that happens, we check for the next slots surrounding the index until we find an empty slot. This is not an ideal solution because we might end up with clusters, meaning, having too many values on one side of the array. This could be time-consuming when we want to retrieve the value that we stored or look for an empty slot to store the next value that is assigned to an index that is already occupied.

As an example, let‘s say we want to store name and age in our hash table. How do we deal with two data with different names, but have been assigned the same index? I have included an image for simple linear probing data.

linear probing for simple data
linear probing for simple data
  1. Chaining: this takes advantage of the structure of a linked list. If a collision occurs, instead of looking for the next empty slot we chain along with the item in the index it was assigned to. Chaining works best more or less if we store 3 items in the bucket, otherwise the time complexity will grow to O(n) as the size of bucket grows which makes our data structure inefficient because we are thriving for the O(1) lookup.
Chaining for simple data
  1. Dynamic Resizing: another alternative to deal with collision would be introducing a method called load factor that keeps track of the number of entries in the bucket to see how evenly our entries are distributed. If the load factor exceeds or falls short of a given threshold then we resize our array to redistribute the entries.

For dynamic resizing instead of an image I will include code I wrote for class and walk through the process of writing it.

We need to calculate the load factor to keep track of the key-value entries in our hash table. It helps us to check if most slots are occupied in the buckets. Calculating the load factor is like calculating mean. Mean is the sum of the data divided by the total number of data, load factor uses the same idea except instead of the data itself it uses the count of each entry in the buckets and divides it by the total number of the bucket.

load factor

Now that we can keep track of the load factor we check for the threshold and resize our array. In the resize function two actions will take place. The first step is copying the entries from the old hash table to a temporary storage, subsequently, we take what’s in the temporary storage to the resized hash table.

resize

Closing

I have come a long way through the process of reading and writing about hash tables. I hope you feel the same way too. I used Wikipedia as my source of truth and got inspiration from Vaidehi Joshi. I am interested in writing about the different applications of the hash function. Until my next article, chaw!

Resources

  1. Hashing out Hash Functions, Vaidehi Joshi
  2. Taking Hash Tables off The Shelf, Vaidehi Joshi
  3. Linear Probing, Wikipedia
  4. Hash Table, Wikipedia
  5. Primary Clustering, Wikipedia

--

--

No responses yet