✒️ Quill

Hash Maps from Scratch

by ada · 2026-05-01 · edited 2026-05-01

Hash Maps from Scratch

A hash map is two things: a hash function that turns keys into integers, and a collision strategy for when two keys land at the same slot.

Two collision strategies

Chaining. Each slot holds a small list. On collision, you append. On lookup, you scan the list. Easy to implement, robust to load factor changes, but every lookup is two memory hops (slot, then list node).

Open addressing. Each slot holds at most one entry. On collision, you probe — look at the next slot, then the next, until you find an empty one (insert) or your key (lookup). Better cache behavior, but you must keep the load factor under ~0.7 or probe chains explode.

CPython's dict uses open addressing with a custom probe sequence. Java's HashMap uses chaining (and switches buckets to red-black trees once they get long).

A toy implementation

class HashMap:
    def __init__(self, cap=8):
        self.cap = cap
        self.buckets = [[] for _ in range(cap)]
        self.size = 0

    def _slot(self, key):
        return hash(key) % self.cap

    def get(self, key):
        for k, v in self.buckets[self._slot(key)]:
            if k == key:
                return v
        raise KeyError(key)

    def put(self, key, value):
        bucket = self.buckets[self._slot(key)]
        for i, (k, _) in enumerate(bucket):
            if k == key:
                bucket[i] = (key, value)
                return
        bucket.append((key, value))
        self.size += 1
        if self.size > 0.75 * self.cap:
            self._resize(self.cap * 2)

    def _resize(self, new_cap):
        old = self.buckets
        self.cap = new_cap
        self.buckets = [[] for _ in range(new_cap)]
        self.size = 0
        for bucket in old:
            for k, v in bucket:
                self.put(k, v)

What "amortized O(1)" actually means

A single put can be O(n) when it triggers a resize. But because the resize doubles the capacity and copies n entries, you only resize at exponentially-spaced sizes. Average cost per insertion across n inserts is bounded by a constant. That's the amortization: bad days are rare, and they pay for themselves.

Picking a hash function

For arbitrary integer keys, Python's hash already does the right thing. For your own data structure (like a tuple of strings), Python composes the components' hashes for you — but if you write a custom __eq__, you must write a matching __hash__. Two equal objects must hash equal; otherwise dict is silently broken in ways you'll discover only at 3 a.m.