Practical Survey on Hash Tables Aurelian Țuțuianu
In memoriam Mihai Pătraşcu (17 July 1982 5 June 2012) I have no intention to ever teach computer science. I want to teach the love for computer science, and let the learning happen. Teaching Statement (http://people.csail.mit.edu/mip/docs/job-application07/statements.pdf)
Abstract Hash table definition Collision resolving schemas: Chained hashing Linear and quadratic probing Cuckoo hashing Some hash function theory Simple tabulation hashing
Omnipresence of hash tables Symbol tables in compilers Cache implementations Database storages Manage memory pages in Linux Route tables Large number of documents
Hash Tables Considering a set of elements S from a finite and much larger universe U. A hash table consists of: hash function h: U {0,.., m 1} vector v of size m
Collisions 26.17.41.60 126.15.12.154 202.223.224.33 7.239.203.66 176.136.103.233 same hash for two different keys What to do? Ignore them Chain colliding values Skip and try again Hash and displace Find a perfect hash function
War Story: cache with hash tables application Problem: An application which gets some data from an expensive repository. hash table Data source Solution: Hash table with collision replacement. Key point: a big chunk of users watched a lot of common data.
Collision Resolution Schemas Chained hashing Open hash: linear and quadratic probing Cuckoo hashing And many many others: perfect hashing, coalesced hashing, Robin Hood hashing, hopscotch hashing, etc.
Chained Hashing 0 Each slot contains a linked list. 1 O( n m ) = O(1) for all operations. 2 y Load factor: n m <1. 3 4 5 6 x z w easy to implement works with weak hash functions consumes significant memory default implementation
Linear and quadratic probing All records are stored in the bucket array itself. h(x,i) = 4 + i w y z x 0 1 2 3 4 5 6 Probe a try to find an empty place. Linear probing h x, i = h 0 (x) + i Quadratic probing i + i i h x, i = h 0 (x) + 2
War Story: Linear probing trick Min. 1st Qu. Median Mean 3rd Qu. Max. 1 1947 3861 3925 5867 8070 linear probing 1 8983 18370 21150 35600 50920 chained hashing
War Story: Let it be quadratic! Replace library implementation with a home-made hash table 4 hours of work
Cuckoo hashing T 1 T 2 Two hash tables, T 1, T 2, of size m, and two hash functions h 1, h 2 : U -> {0,..., m 1}. h 1 (x) x z y Value x stored in cell h 1 (x) of T1 or in cell h 2 (x) of T2. Hash and displace. Lookup is constant in worst case! w h 2 (x) Updates in constant amortized time.
What about hash functions? Any hash function is good? What does a good hash function mean? Can I have my own?
The beginning of time Introduced by Alfred Dumey in 1956 for the symbol table in a compiler. He used a crazy, chaotic, random function h:u->{0..m-1}. h(x)=(x mod p) mod m, with p a big prime number. Is seems to work, but why?
First station: rigorous analysis Consider that h really is a random function! Knuth established a way to make a complete analysis, but based on a false assumption. No matter how long you stare at h(x)=(x mod p) mod m, it will not morph into a random function!
Next station: universality and k-independence Wegman and Carter (1978) A family of hash functions No need of perfect random hash function, but universal : x 1,x 2 S x 1 x 2, Pr[h(x 1 )=h(x 2 )] 1 N In generalized form the k-independence model uses statistics to measure how much random can a family of hash functions produce!
How it works? Random data x formula h(x) Universal multiplicative shift: h a x = a x l l out 2-independent multiplicative shift: h a,b x = a x + b 2l l out k-independent polynomial hashing: k 1 h x = i=0 a i x i mod p mod 2 l out
Facts on k-independence Chained hashing 1978 - Wegman, Carter: requires only universal hashing Linear probing 1990 Siegel, Schmidt: O(logn)-independece is enough 2007 Pagh 5-independence suffices 2010 Patrascu,Thorup 4-independence is not enough Cuckoo hashing 2001 Pagh: O(logn)-independence is enough 2005 Cohen, Kane: 5-independence is not enough 2006 Cohen, Kane: 6-independence is enough
Simple tabulation hashing Simple tabulation is the fastest 3-independent family of hash functions known. Key x of length len (required bit width to store values) is divided into c chars x 1, x 2,.., x c We create c tables R 1, R 2,.., R c, filled with independent random values Hash value is created with function h x = R 1 x 1 R 2 x 2 R c x c x R 1 x 1 R 2 x 2 R 3 x 3 R 4 x 4 4 lookup tables with random 8-bit values h(x)
The power of simple tabulation! The power of simple tabulation hashing Mihai Pătrașcu, Mikkel Thorup December 6, 2011 According to this paper, even if is only 3-independent, we have: Constant time for linear probing Constant time for static cuckoo hashing => There are also other probabilistic properties which can be exploited, other than ones captured in k-independence theory
Summary Easy ways to implement optimal hash tables Simple scheme to generate a hash function family Theory produces practical results and is still alive! There are a lot of occasions to apply these ideas, so: Work hard, have fun and make history!
Questions?