CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions. Linda Shapiro Spring 2016



Similar documents
Review of Hashing: Integer Keys

Hash Tables. Computer Science E-119 Harvard Extension School Fall 2012 David G. Sullivan, Ph.D. Data Dictionary Revisited

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Fundamental Algorithms

Practical Survey on Hash Tables. Aurelian Țuțuianu

CSE373: Data Structures and Algorithms Lecture 1: Introduction; ADTs; Stacks/Queues. Linda Shapiro Spring 2016

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Data Structures in Java. Session 15 Instructor: Bert Huang

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Lecture Notes on Binary Search Trees

Lecture 2 February 12, 2003

CSE 326: Data Structures B-Trees and B+ Trees

DATA STRUCTURES USING C

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

Chapter 8: Structures for Files. Truong Quynh Chi Spring- 2013

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

A Comparison of Dictionary Implementations

From Last Time: Remove (Delete) Operation

The Taxman Game. Robert K. Moniot September 5, 2003

Analysis of a Search Algorithm

Sequences. A sequence is a list of numbers, or a pattern, which obeys a rule.

Heaps & Priority Queues in the C++ STL 2-3 Trees

Introduction. Appendix D Mathematical Induction D1

Record Storage and Primary File Organization

Sequential Data Structures

Binary Heap Algorithms

Computer Science 210: Data Structures. Searching

Queues Outline and Required Reading: Queues ( 4.2 except 4.2.4) COSC 2011, Fall 2003, Section A Instructor: N. Vlajic

Class Overview. CSE 326: Data Structures. Goals. Goals. Data Structures. Goals. Introduction

Big O and Limits Abstract Data Types Data Structure Grand Tour.

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

Lecture Notes on Binary Search Trees

Chapter 13: Query Processing. Basic Steps in Query Processing

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

CHAPTER 13: DISK STORAGE, BASIC FILE STRUCTURES, AND HASHING

Lecture 1: Data Storage & Index

Binary Search Trees CMPSC 122

COMPSCI 105 S2 C - Assignment 2 Due date: Friday, 23 rd October 7pm

BM307 File Organization

The application of prime numbers to RSA encryption

Symbol Tables. Introduction

CSE 135: Introduction to Theory of Computation Decidability and Recognizability

Experiment 2: Conservation of Momentum

1 Portfolio Selection

Introduction to Data Structures

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES

Sample Questions Csci 1112 A. Bellaachia

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

INTRODUCTION The collection of data that makes up a computerized database must be stored physically on some computer storage medium.

CSE 2123 Collections: Sets and Iterators (Hash functions and Trees) Jeremy Morris

Data Warehousing und Data Mining

Notes on Factoring. MA 206 Kurt Bryan

Lecture 13 - Basic Number Theory.

Algorithms. Margaret M. Fleck. 18 October 2010

Introduction to Programming System Design. CSCI 455x (4 Units)

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING LESSON PLAN

DNS LOOKUP SYSTEM DATA STRUCTURES AND ALGORITHMS PROJECT REPORT

Algorithms and optimization for search engine marketing

Rethinking SIMD Vectorization for In-Memory Databases

Data Structures Fibonacci Heaps, Amortized Analysis

Questions 1 through 25 are worth 2 points each. Choose one best answer for each.

Mathematical Induction. Lecture 10-11

Analysis of Algorithms I: Binary Search Trees

DATABASE DESIGN - 1DL400

Factoring & Primality

Lecture 1: Course overview, circuits, and formulas

Lecture 6 Online and streaming algorithms for clustering

Seminar. Path planning using Voronoi diagrams and B-Splines. Stefano Martina

6.042/18.062J Mathematics for Computer Science. Expected Value I

Exporting an image from Sibelius to Microsoft Word

The Tower of Hanoi. Recursion Solution. Recursive Function. Time Complexity. Recursive Thinking. Why Recursion? n! = n* (n-1)!

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Session 7 Fractions and Decimals

Lecture 19: Chapter 8, Section 1 Sampling Distributions: Proportions

Discrete Math in Computer Science Homework 7 Solutions (Max Points: 80)

1a 1b Total /10 pts /10 pts /20 pts

In mathematics, it is often important to get a handle on the error term of an approximation. For instance, people will write

Language Modeling. Chapter Introduction

recursion, O(n), linked lists 6/14

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Lecture 6 - Cryptography

The CAGED Guitar System

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb

CS3220 Lecture Notes: QR factorization and orthogonal transformations

Homework until Test #2

Name: 1. CS372H: Spring 2009 Final Exam

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Testing LTL Formula Translation into Büchi Automata

1 Shapes of Cubic Functions

The Circumference Function

RSA Question 2. Bob thinks that p and q are primes but p isn t. Then, Bob thinks Φ Bob :=(p-1)(q-1) = φ(n). Is this true?

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

Transcription:

CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions Linda Shapiro Spring 2016

Announcements Friday: Review List and go over answers to Practice Problems 2

Hash Tables: Review Aim for constant-time (i.e., O(1)) find, insert, and delete On average under some reasonable assumptions A hash table is an array of some fixed size But growable as we ll see hash table 0 client hash table library collision? E int table-index collision resolution TableSize 1 3

Collision resolution Collision: When two keys map to the same location in the hash table We try to avoid it, but number-of-keys exceeds table size So hash tables should support collision resolution Ideas? 4

Separate Chaining 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / Chaining: All keys that map to the same table location are kept in a list (a.k.a. a chain or bucket ) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10 5

Separate Chaining 0 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / Chaining: All keys that map to the same table location are kept in a list (a.k.a. a chain or bucket ) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10 6

Separate Chaining 0 1 / 2 3 / 4 / 5 / 6 / 7 / 8 / 9 / 10 / 22 / Chaining: All keys that map to the same table location are kept in a list (a.k.a. a chain or bucket ) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10 7

Separate Chaining 0 1 / 2 3 / 4 / 5 / 6 / 7 8 / 9 / 10 / 22 / 107 / Chaining: All keys that map to the same table location are kept in a list (a.k.a. a chain or bucket ) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10 8

Separate Chaining 0 1 / 2 3 / 10 / 12 22 / Chaining: All keys that map to the same table location are kept in a list (a.k.a. a chain or bucket ) 4 / 5 / As easy as it sounds 6 / Example: 7 107 / insert 10, 22, 107, 12, 42 8 / 9 / with mod hashing and TableSize = 10 9

Separate Chaining 0 1 / 2 3 / 4 / 5 / 6 / 7 8 / 9 / 10 / 42 107 / 12 22 / Chaining: All keys that map to the same table location are kept in a list (a.k.a. a chain or bucket ) As easy as it sounds Example: insert 10, 22, 107, 12, 42 with mod hashing and TableSize = 10 10

Thoughts on chaining Worst-case time for find? Linear But only with really bad luck or bad hash function So not worth avoiding (e.g., with balanced trees at each bucket) Beyond asymptotic complexity, some data-structure engineering may be warranted Linked list vs. array vs. tree Move-to-front upon access Maybe leave room for 1 element (or 2?) in the table itself, to optimize constant factors for the common case A time-space trade-off 11

Time vs. space (constant factors only here) 0 1 / 2 3 / 4 / 5 / 6 / 7 8 / 9 / 10 / 42 107 / 12 22 / 0 10 / 1 / X 2 42 3 / X 4 / X 5 / X 6 / X 7 107 / 8 / X 9 / X 12 22 / 12

More rigorous chaining analysis Definition: The load factor, λ, of a hash table is λ = N TableSize number of elements Under chaining, the average number of elements per bucket is 13

More rigorous chaining analysis Definition: The load factor, λ, of a hash table is λ = N TableSize number of elements Under chaining, the average number of elements per bucket is λ ie. The average list has length λ 14

More rigorous chaining analysis Definition: The load factor, λ, of a hash table is λ = N TableSize number of elements Under chaining, the average number of elements per bucket is λ ie. The average list has length λ So if some inserts are followed by random finds, then on average: Each unsuccessful find compares against items 15

More rigorous chaining analysis Definition: The load factor, λ, of a hash table is λ = N TableSize number of elements Under chaining, the average number of elements per bucket is λ ie. The average list has length λ So if some inserts are followed by random finds, then on average: Each unsuccessful find compares against λ items Each successful find compares against items 16

More rigorous chaining analysis Definition: The load factor, λ, of a hash table is λ = N TableSize number of elements Under chaining, the average number of elements per bucket is λ ie. The average list has length λ So if some inserts are followed by random finds, then on average: Each unsuccessful find compares against λ items Each successful find compares against λ / 2 items So we like to keep λ fairly low (e.g., 1 or 1.5 or 2) for chaining 17

Alternative: No lists; Use empty space in the table Another simple idea: If h(key) is already full, try (h(key) + 1) % TableSize. If full, try (h(key) + 2) % TableSize. If full, try (h(key) + 3) % TableSize. If full Example: insert 38, 19, 8, 109, 10 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 38 9 / 18

Alternative: Use empty space in the table Another simple idea: If h(key) is already full, try (h(key) + 1) % TableSize. If full, try (h(key) + 2) % TableSize. If full, try (h(key) + 3) % TableSize. If full Example: insert 38, 19, 8, 109, 10 0 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 38 9 19 19

Alternative: Use empty space in the table Another simple idea: If h(key) is already full, try (h(key) + 1) % TableSize. If full, try (h(key) + 2) % TableSize. If full, try (h(key) + 3) % TableSize. If full Example: insert 38, 19, 8, 109, 10 0 8 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 38 9 19 20

Alternative: Use empty space in the table Another simple idea: If h(key) is already full, try (h(key) + 1) % TableSize. If full, try (h(key) + 2) % TableSize. If full, try (h(key) + 3) % TableSize. If full Example: insert 38, 19, 8, 109, 10 0 8 1 109 2 / 3 / 4 / 5 / 6 / 7 / 8 38 9 19 21

Alternative: Use empty space in the table Another simple idea: If h(key) is already full, try (h(key) + 1) % TableSize. If full, try (h(key) + 2) % TableSize. If full, try (h(key) + 3) % TableSize. If full Example: insert 38, 19, 8, 109, 10 0 8 1 109 2 10 3 / 4 / 5 / 6 / 7 / 8 38 9 19 22

Probing hash tables Trying the next spot is called probing (also called open addressing) We just did linear probing i th probe was (h(key) + i) % TableSize In general have some probe function f and use h(key) + f(i) % TableSize Open addressing does poorly with high load factor λ So want larger tables Too many probes means no more O(1) 23

Other operations insert finds an open table position using a probe function What about find? Must use same probe function to retrace the trail for the data Unsuccessful search when reach empty position What about delete? Must use lazy deletion. Why? Marker indicates no data here, but don t stop probing Note: delete with chaining is plain-old list-remove 24

(Primary) Clustering It turns out linear probing is a bad idea, even though the probe function is quick to compute (which is a good thing) Tends to produce clusters, which lead to long probing sequences Called primary clustering Saw this starting in our example [R. Sedgewick] 25

Analysis of Linear Probing Trivial fact: For any λ < 1, linear probing will find an empty slot It is safe in this sense: no infinite loop unless table is full Non-trivial facts we won t prove: Average # of probes given λ (in the limit as TableSize ) Unsuccessful search: 1 2 1 1 + 2 ( 1 λ) Successful search: 1 2 1+ 1 ( ) 1 λ This is pretty bad: need to leave sufficient empty space in the table to get decent performance (see chart) 26

In a chart Linear-probing performance degrades rapidly as table gets full (Formula assumes large table but point remains) Linear Probing Linear Probing Average # of Probes 20.00 15.00 10.00 5.00 0.00 0.00 0.20 0.40 0.60 0.80 1.00 Load Factor linear probing found linear probing not found Average # of Probes 350.00 300.00 250.00 200.00 150.00 100.00 50.00 0.00 0.00 0.20 0.40 0.60 0.80 1.00 Load Factor linear probing found linear probing not found By comparison, chaining performance is linear in λ and has no trouble with λ>1 27

Quadratic probing We can avoid primary clustering by changing the probe function (h(key) + f(i)) % TableSize A common technique is quadratic probing: f(i) = i 2 So probe sequence is: 0 th probe: h(key) % TableSize 1 st probe: (h(key) + 1) % TableSize 2 nd probe: (h(key) + 4) % TableSize 3 rd probe: (h(key) + 9) % TableSize i th probe: (h(key) + i 2 ) % TableSize Intuition: Probes quickly leave the neighborhood 28

Quadratic Probing Example 0 1 2 3 4 5 6 7 8 9 TableSize=10 Insert: 89 18 49 58 79 29

Quadratic Probing Example 0 1 2 3 4 5 6 7 8 9 89 TableSize=10 Insert: 89 18 49 58 79 30

Quadratic Probing Example 0 1 2 3 4 5 6 7 8 18 9 89 TableSize=10 Insert: 89 18 49 58 79 31

Quadratic Probing Example 0 49 1 2 3 4 5 6 7 8 18 9 89 TableSize=10 Insert: 89 18 49 58 79 32

Quadratic Probing Example 0 49 1 2 58 3 4 5 6 7 8 18 9 89 TableSize=10 Insert: 89 18 49 58 79 33

Quadratic Probing Example 0 49 1 2 58 3 79 4 5 6 7 8 18 9 89 TableSize=10 Insert: 89 18 49 58 79 34

Another Quadratic Probing Example 0 1 2 3 4 5 6 TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) 35

Another Quadratic Probing Example 0 1 2 3 4 5 6 76 TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) 36

Another Quadratic Probing Example 0 1 2 3 4 5 40 6 76 TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) 37

Another Quadratic Probing Example 0 48 1 2 3 4 5 40 6 76 TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) 38

Another Quadratic Probing Example 0 48 1 2 5 3 4 5 40 6 76 TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) 39

Another Quadratic Probing Example 0 48 1 2 5 3 55 4 5 40 6 76 TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) 40

Another Quadratic Probing Example 0 48 1 2 5 3 55 4 5 40 6 76 TableSize = 7 Insert: 76 (76 % 7 = 6) 40 (40 % 7 = 5) 48 (48 % 7 = 6) 5 ( 5 % 7 = 5) 55 (55 % 7 = 6) 47 (47 % 7 = 5) Doh!: For all n, ((n*n) +5) % 7 is 0, 2, 5, or 6 No where to put the 47! 41

From Bad News to Good News Bad news: Quadratic probing can cycle through the same full indices, never terminating despite table not being full Good news: If TableSize is prime and λ < ½, then quadratic probing will find an empty slot in at most TableSize/2 probes So: If you keep λ < ½ and TableSize is prime, no need to detect cycles 42

Clustering reconsidered Quadratic probing does not suffer from primary clustering: no problem with keys initially hashing to the same neighborhood But it s no help if keys initially hash to the same index Called secondary clustering Can avoid secondary clustering with a probe function that depends on the key: double hashing 43

Double hashing Idea: Given two good hash functions h and g, it is very unlikely that for some key, h(key) == g(key) So make the probe function f(i) = i*g(key) Probe sequence: 0 th probe: h(key) % TableSize 1 st probe: (h(key) + g(key)) % TableSize 2 nd probe: (h(key) + 2*g(key)) % TableSize 3 rd probe: (h(key) + 3*g(key)) % TableSize i th probe: (h(key) + i*g(key)) % TableSize Detail: Make sure g(key) cannot be 0 44

Double-hashing analysis Intuition: Because each probe is jumping by g(key) each time, we leave the neighborhood and go different places from other initial collisions But we could still have a problem like in quadratic probing where we are not safe (infinite loop despite room in table) It is known that this cannot happen in at least one case: h(key) = key % p g(key) = q (key % q) 2 < q < p p and q are prime 45

Rehashing As with array-based stacks/queues/lists, if table gets too full, create a bigger table and copy everything With chaining, we get to decide what too full means Keep load factor reasonable (e.g., < 1)? Consider average or max size of non-empty chains? For probing, half-full is a good rule of thumb New table size Twice-as-big is a good idea, except that won t be prime! So go about twice-as-big Can have a list of prime numbers in your code since you won t grow more than 20-30 times 46

Summary Hashing gives us approximately O(1) behavior for both insert and find. Collisions are what ruin it. There are several different collision strategies. Chaining just uses linked lists pointed to by the hash table bins. Probing uses various methods for computing the next index to try if the first one is full. Rehashing makes a new, bigger table. If the table is kept reasonably empty (small load factor), and the hash function works well, we will get the kind of behavior we want. 47