Practical Survey on Hash Tables. Aurelian Țuțuianu



Similar documents
Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions. Linda Shapiro Spring 2016

Review of Hashing: Integer Keys

A Survey on Efficient Hashing Techniques in Software Configuration Management

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES

Fundamental Algorithms

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Cuckoo Filter: Practically Better Than Bloom

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)

The Advantages and Disadvantages of Network Computing Nodes

Scalable Prefix Matching for Internet Packet Forwarding

Lecture 2 February 12, 2003

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Vulnerability Analysis of Hash Tables to Sophisticated DDoS Attacks

SHARED HASH TABLES IN PARALLEL MODEL CHECKING

Two Binary Algorithms for Calculating the Jacobi Symbol and a Fast Systolic Implementation in Hardware

Lecture 10: Dynamic Memory Allocation 1: Into the jaws of malloc()

BM307 File Organization

Big Data & Scripting Part II Streaming Algorithms

MODELING RANDOMNESS IN NETWORK TRAFFIC

Factoring Algorithms

A Comparison of Dictionary Implementations

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

H/wk 13, Solutions to selected problems

Data Structures in Java. Session 15 Instructor: Bert Huang

1 Formulating The Low Degree Testing Problem

Algorithmic Aspects of Big Data. Nikhil Bansal (TU Eindhoven)

DNS LOOKUP SYSTEM DATA STRUCTURES AND ALGORITHMS PROJECT REPORT

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Project Group High- performance Flexible File System 2010 / 2011

Digital Signatures. (Note that authentication of sender is also achieved by MACs.) Scan your handwritten signature and append it to the document?

Project: Simulated Encrypted File System (SEFS)

Chapter 11: File System Implementation. Chapter 11: File System Implementation. Objectives. File-System Structure

Primality Testing and Factorization Methods

CHAPTER 13: DISK STORAGE, BASIC FILE STRUCTURES, AND HASHING

Ex. 2.1 (Davide Basilio Bartolini)

Integer Factorization using the Quadratic Sieve

Factoring - Solve by Factoring

New Hash Function Construction for Textual and Geometric Data Retrieval

Introduction. Appendix D Mathematical Induction D1

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Bloom Filter based Inter-domain Name Resolution: A Feasibility Study

Privacy and Security in library RFID Issues, Practices and Architecture

INTRODUCTION The collection of data that makes up a computerized database must be stored physically on some computer storage medium.

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Theoretical Aspects of Storage Systems Autumn 2009

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

18-548/ Associativity 9/16/98. 7 Associativity / Memory System Architecture Philip Koopman September 16, 1998

Multi-dimensional index structures Part I: motivation

Rethinking SIMD Vectorization for In-Memory Databases

John S. Otto Fabián E. Bustamante

Factoring & Primality

NETWORK SECURITY: How do servers store passwords?

Life Cycle of a Memory Request. Ring Example: 2 requests for lock 17

Study of algorithms for factoring integers and computing discrete logarithms

Topological Properties

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

Modélisation et résolutions numérique et symbolique

Analysing equity portfolios in R

ZQL. a cryptographic compiler for processing private data. George Danezis. Joint work with Cédric Fournet, Markulf Kohlweiss, Zhengqin Luo

Hash Tables. Computer Science E-119 Harvard Extension School Fall 2012 David G. Sullivan, Ph.D. Data Dictionary Revisited

LECTURE 4. Last time: Lecture outline

Class Overview. CSE 326: Data Structures. Goals. Goals. Data Structures. Goals. Introduction

Chapter 13: Query Processing. Basic Steps in Query Processing

Exploratory Data Analysis

Packet forwarding using improved Bloom filters

Digital Signatures. Murat Kantarcioglu. Based on Prof. Li s Slides. Digital Signatures: The Problem

Visual Basic Programming. An Introduction

Faster deterministic integer factorisation

OPERATING SYSTEMS MEMORY MANAGEMENT

Answer Key for California State Standards: Algebra I

Sudoku puzzles and how to solve them

Accelerate Cloud Computing with the Xilinx Zynq SoC

Habanero Extreme Scale Software Research Project

Overview of Cryptographic Tools for Data Security. Murat Kantarcioglu

Less Hashing, Same Performance: Building a Better Bloom Filter

Record Storage and Primary File Organization

Big data coming soon... to an NSI near you. John Dunne. Central Statistics Office (CSO), Ireland

Partitioning under the hood in MySQL 5.5

Data Structures For IP Lookup With Bursty Access Patterns

An Overview of Integer Factoring Algorithms. The Problem

Outline. Computer Science 418. Digital Signatures: Observations. Digital Signatures: Definition. Definition 1 (Digital signature) Digital Signatures

Transcription:

Practical Survey on Hash Tables Aurelian Țuțuianu

In memoriam Mihai Pătraşcu (17 July 1982 5 June 2012) I have no intention to ever teach computer science. I want to teach the love for computer science, and let the learning happen. Teaching Statement (http://people.csail.mit.edu/mip/docs/job-application07/statements.pdf)

Abstract Hash table definition Collision resolving schemas: Chained hashing Linear and quadratic probing Cuckoo hashing Some hash function theory Simple tabulation hashing

Omnipresence of hash tables Symbol tables in compilers Cache implementations Database storages Manage memory pages in Linux Route tables Large number of documents

Hash Tables Considering a set of elements S from a finite and much larger universe U. A hash table consists of: hash function h: U {0,.., m 1} vector v of size m

Collisions 26.17.41.60 126.15.12.154 202.223.224.33 7.239.203.66 176.136.103.233 same hash for two different keys What to do? Ignore them Chain colliding values Skip and try again Hash and displace Find a perfect hash function

War Story: cache with hash tables application Problem: An application which gets some data from an expensive repository. hash table Data source Solution: Hash table with collision replacement. Key point: a big chunk of users watched a lot of common data.

Collision Resolution Schemas Chained hashing Open hash: linear and quadratic probing Cuckoo hashing And many many others: perfect hashing, coalesced hashing, Robin Hood hashing, hopscotch hashing, etc.

Chained Hashing 0 Each slot contains a linked list. 1 O( n m ) = O(1) for all operations. 2 y Load factor: n m <1. 3 4 5 6 x z w easy to implement works with weak hash functions consumes significant memory default implementation

Linear and quadratic probing All records are stored in the bucket array itself. h(x,i) = 4 + i w y z x 0 1 2 3 4 5 6 Probe a try to find an empty place. Linear probing h x, i = h 0 (x) + i Quadratic probing i + i i h x, i = h 0 (x) + 2

War Story: Linear probing trick Min. 1st Qu. Median Mean 3rd Qu. Max. 1 1947 3861 3925 5867 8070 linear probing 1 8983 18370 21150 35600 50920 chained hashing

War Story: Let it be quadratic! Replace library implementation with a home-made hash table 4 hours of work

Cuckoo hashing T 1 T 2 Two hash tables, T 1, T 2, of size m, and two hash functions h 1, h 2 : U -> {0,..., m 1}. h 1 (x) x z y Value x stored in cell h 1 (x) of T1 or in cell h 2 (x) of T2. Hash and displace. Lookup is constant in worst case! w h 2 (x) Updates in constant amortized time.

What about hash functions? Any hash function is good? What does a good hash function mean? Can I have my own?

The beginning of time Introduced by Alfred Dumey in 1956 for the symbol table in a compiler. He used a crazy, chaotic, random function h:u->{0..m-1}. h(x)=(x mod p) mod m, with p a big prime number. Is seems to work, but why?

First station: rigorous analysis Consider that h really is a random function! Knuth established a way to make a complete analysis, but based on a false assumption. No matter how long you stare at h(x)=(x mod p) mod m, it will not morph into a random function!

Next station: universality and k-independence Wegman and Carter (1978) A family of hash functions No need of perfect random hash function, but universal : x 1,x 2 S x 1 x 2, Pr[h(x 1 )=h(x 2 )] 1 N In generalized form the k-independence model uses statistics to measure how much random can a family of hash functions produce!

How it works? Random data x formula h(x) Universal multiplicative shift: h a x = a x l l out 2-independent multiplicative shift: h a,b x = a x + b 2l l out k-independent polynomial hashing: k 1 h x = i=0 a i x i mod p mod 2 l out

Facts on k-independence Chained hashing 1978 - Wegman, Carter: requires only universal hashing Linear probing 1990 Siegel, Schmidt: O(logn)-independece is enough 2007 Pagh 5-independence suffices 2010 Patrascu,Thorup 4-independence is not enough Cuckoo hashing 2001 Pagh: O(logn)-independence is enough 2005 Cohen, Kane: 5-independence is not enough 2006 Cohen, Kane: 6-independence is enough

Simple tabulation hashing Simple tabulation is the fastest 3-independent family of hash functions known. Key x of length len (required bit width to store values) is divided into c chars x 1, x 2,.., x c We create c tables R 1, R 2,.., R c, filled with independent random values Hash value is created with function h x = R 1 x 1 R 2 x 2 R c x c x R 1 x 1 R 2 x 2 R 3 x 3 R 4 x 4 4 lookup tables with random 8-bit values h(x)

The power of simple tabulation! The power of simple tabulation hashing Mihai Pătrașcu, Mikkel Thorup December 6, 2011 According to this paper, even if is only 3-independent, we have: Constant time for linear probing Constant time for static cuckoo hashing => There are also other probabilistic properties which can be exploited, other than ones captured in k-independence theory

Summary Easy ways to implement optimal hash tables Simple scheme to generate a hash function family Theory produces practical results and is still alive! There are a lot of occasions to apply these ideas, so: Work hard, have fun and make history!

Questions?