Review of Hashing: Integer Keys



Similar documents
CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions. Linda Shapiro Spring 2016

Hash Tables. Computer Science E-119 Harvard Extension School Fall 2012 David G. Sullivan, Ph.D. Data Dictionary Revisited

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

From Last Time: Remove (Delete) Operation

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

CSE 326: Data Structures B-Trees and B+ Trees

Fundamental Algorithms

Record Storage and Primary File Organization

Lecture 1: Data Storage & Index

Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)

Physical Data Organization

INTRODUCTION The collection of data that makes up a computerized database must be stored physically on some computer storage medium.

Data Structures in Java. Session 15 Instructor: Bert Huang

Converting a Number from Decimal to Binary

DATABASE DESIGN - 1DL400

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb

Class Notes CS Creating and Using a Huffman Code. Ref: Weiss, page 433

Sample Questions Csci 1112 A. Bellaachia

Practical Survey on Hash Tables. Aurelian Țuțuianu

Part 5: More Data Structures for Relations

Symbol Tables. Introduction

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

Lecture Notes on Binary Search Trees

- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time

Project 2: Bejeweled

DATA STRUCTURES USING C

1. Define: (a) Variable, (b) Constant, (c) Type, (d) Enumerated Type, (e) Identifier.

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

recursion, O(n), linked lists 6/14

A Comparison of Dictionary Implementations

Project Group High- performance Flexible File System 2010 / 2011

CHAPTER 13: DISK STORAGE, BASIC FILE STRUCTURES, AND HASHING

Chapter 8: Structures for Files. Truong Quynh Chi Spring- 2013

Data storage Tree indexes

DNS LOOKUP SYSTEM DATA STRUCTURES AND ALGORITHMS PROJECT REPORT

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

A binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and:

Playing with Numbers

An overview of FAT12

Chapter 13. Disk Storage, Basic File Structures, and Hashing

CompSci-61B, Data Structures Final Exam

Ordered Lists and Binary Trees

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

University of Massachusetts Amherst Department of Computer Science Prof. Yanlei Diao

Solutions to Problem Set 1

Motivation Suppose we have a database of people We want to gure out who is related to whom Initially, we only have a list of people, and information a

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Heaps & Priority Queues in the C++ STL 2-3 Trees

Sequential Data Structures

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.

Chapter 13: Query Processing. Basic Steps in Query Processing

Data Structures and Algorithms

Math 319 Problem Set #3 Solution 21 February 2002

Binary Search Trees CMPSC 122

Data Warehousing und Data Mining

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Binary Heap Algorithms

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

Lecture Notes on Binary Search Trees

Introduction to Data Structures

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

CIS 631 Database Management Systems Sample Final Exam

Chapter 8: Bags and Sets

BM307 File Organization

Naming vs. Locating Entities

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Full and Complete Binary Trees

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING LESSON PLAN

The ADT Binary Search Tree

St S a t ck a ck nd Qu Q eue 1

The Taxman Game. Robert K. Moniot September 5, 2003

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

File System Management

Lecture 6: Binary Search Trees CSCI Algorithms I. Andrew Rosenberg

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Performance Tuning for the Teradata Database

Analysis of a Search Algorithm

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Efficient Data Structures for Decision Diagrams

Q&As: Microsoft Excel 2013: Chapter 2

Part III Storage Management. Chapter 11: File System Implementation

Benford s Law: Tables of Logarithms, Tax Cheats, and The Leading Digit Phenomenon

Database Systems. Session 8 Main Theme. Physical Database Design, Query Execution Concepts and Database Programming Techniques

Structure for String Keys

C++ INTERVIEW QUESTIONS

Lecture 12 Doubly Linked Lists (with Recursion)

Merkle Hash Trees for Distributed Audit Logs

10CS35: Data Structures Using C

Big O and Limits Abstract Data Types Data Structure Grand Tour.

The Tower of Hanoi. Recursion Solution. Recursive Function. Time Complexity. Recursive Thinking. Why Recursion? n! = n* (n-1)!

Questions 1 through 25 are worth 2 points each. Choose one best answer for each.

Transcription:

CSE 326 Lecture 13: Much ado about Hashing Today s munchies to munch on: Review of Hashing Collision Resolution by: Separate Chaining Open Addressing $ Linear/Quadratic Probing $ Double Hashing Rehashing Extendible Hashing Covered in Chapter 5 in the text 1 Review of Hashing: Integer Keys Idea: Store data record with its key in array slot: A[Hash(key)] where Hash is a hashing function. Integer Keys: Hash(key) = key mod TableSize TableSize is size of the array (preferably a prime number) 2

Review of Hashing: String Keys Idea: Store data record with its key in array slot: A[Hash(key)] where Hash is a hashing function. String Keys: Treat characters as digits (e.g. use ASCII value) Hash(key) = StringInt(key) mod TableSize Examples: StringInt( abc ) = 1*27 2 + 2*27 1 + 3 = 786 StringInt( bca ) = 2*27 2 + 3*27 1 + 1 = 1540 StringInt( cab ) = 3*27 2 + 1*27 1 + 2 = 2216 3 Collisions What if two different keys hash to the same value? E.g. TableSize = 17. Keys 18 and 35 hash to same value: 18 mod 17 = 1 and 35 mod 17 = 1 Cannot store both data records in the same slot in array! This is called a collision How can we resolve collisions during hashing? 4

Collision Resolution Two different methods: Separate Chaining: Use data structure (such as a linked list) to store multiple items that hash to the same slot Open addressing (or probing): search for other slots using a second function and store item in first empty slot that is found 5 Collision Resolution Two different methods: Separate Chaining: Use data structure (such as a linked list) to store multiple items that hash to the same slot Open addressing (or probing): search for other slots using a second function and store item in first empty slot that is found Chaining and probing??? Get me outta here!! 6

Separate Chaining Each hash table cell holds a pointer to a linked list of records with same hash value (i, j, k in figure) Collision: Insert item into linked list To Find an item: compute hash value, then do Find on linked list Can use List ADT for Find/Insert/Delete in linked list Can also use BSTs: O(log N) time instead of O(N). But lists are usually small not worth the overhead of BSTs 7 Load Factor of a Hash Table Let N = number of items to be stored Load factor λ = N/TableSize Suppose TableSize = 2 and number of items N = 10 λ = 5 Suppose TableSize = 10 and number of items N = 2 λ = 0.2 Average length of chained list = λ Average time for accessing an item = O(1) + O(λ) Want λ to be close to 1 (i.e. TableSize N) But chaining continues to work for λ > 1 8

Collision Resolution by Open Addressing Linked lists can take up a lot of space Open addressing (or probing): When collision occurs, try alternative cells in the array until an empty cell is found Given an item X, try cells h 0 (X), h 1 (X), h 2 (X),, h i (X) h i (X) = (Hash(X) + F(i)) mod TableSize Define F(0) = 0 F is the collision resolution function. Three possibilities: Linear: F(i) = i Quadratic: F(i) = i 2 Double Hashing: F(i) = i Hash 2 (X) 9 Open Addressing I: Linear Probing Main Idea: When collision occurs, scan down the array one cell at a time looking for an empty cell h i (X) = (Hash(X) + i) mod TableSize (i = 0, 1, 2, ) Compute hash value and increment until free cell is found In-Class Example: Insert {18, 19, 20, 29, 30, 31} into empty hash table with TableSize = 10 using: (a) separate chaining (b) linear probing 10

Load Factor Analysis of Linear Probing Recall: Load factor λ = N/TableSize Fraction of empty cells = 1 - λ Number of such cells we expect to probe = 1/(1- λ) Can show that expected number of probes for: Successful searches = O(1+1/(1- λ)) Insertions and unsuccessful searches = O(1+1/(1- λ) 2 ) Keep λ 0.5 to keep number of probes small (between 1 and 5). (E.g. What happens when λ = 0.99) 11 Drawbacks of Linear Probing Works until array is full, but as number of items N approaches TableSize (λ 1), access time approaches O(N) Very prone to cluster formation (as in our example) If key hashes into a cluster, finding free cell involves going through the entire cluster Inserting this key at the end of cluster causes the cluster to grow: future Inserts will be even more time consuming! This type of clustering is called Primary Clustering Can have cases where table is empty except for a few clusters Does not satisfy good hash function criterion of distributing keys uniformly 12

Open Addressing II: Quadratic Probing Main Idea: Spread out the search for an empty slot Increment by i 2 instead of i h i (X) = (Hash(X) + i 2 ) mod TableSize (i = 0, 1, 2, ) No primary clustering but secondary clustering possible Example 1: Insert {18, 19, 20, 29, 30, 31} into empty hash table with TableSize = 10 Example 2: Insert {1, 2, 5, 10, 17} with TableSize = 16 Note: 25 mod 16 = 9, 36 mod 16 = 4, 49 mod 16 = 1, etc. Theorem: If TableSize is prime and λ < 0.5, quadratic probing will always find an empty slot 13 Open Addressing III: Double Hashing Idea: Spread out the search for an empty slot by using a second hash function No primary or secondary clustering h i (X) = (Hash(X) + i Hash 2 (X)) mod TableSize for i = 0, 1, 2, E.g. Hash 2 (X) = R (X mod R) R is a prime smaller than TableSize Try this example: Insert {18, 19, 20, 29, 30, 31} into empty hash table with TableSize = 10 and R = 7 No clustering but slower than quadratic probing due to Hash 2 14

The need to be lazy Data Struct - ures Need to use lazy deletion if we use probing (why?) Think about how Find(X) would work Mark array slots as Active/Not Active If table gets too full (λ 1) or if many deletions have occurred: Running time for Find etc. gets too long, and Inserts may fail! What do we do? 15 Rehashing Rehashing Allocate a larger hash table (of size 2*TableSize) whenever λ exceeds a particular value How does it work? Cannot just copy data from old table: Bigger table has a new hash function Go through old hash table, ignoring items marked deleted Recompute hash value for each non-deleted key and put the item in new position in new table Running time = O(N) but happens very infrequently 16

Extendible Hashing What if we have large amounts of data that can only be stored on disks and we want to find data in 1-2 disk accesses Could use B-trees but deciding which of many branches to go to takes time Extendible Hashing: Store item according to its bit pattern Hash(X) = first d L bits of X Each leaf contains M data items with d L identical leading bits Root contains pointers to sorted data items in the leaves 17 Extendible Hashing: The details Extendible Hashing: Store data Hash(X) = first 2 bits of X according to bit patterns Directory Root is known as the directory M is the size of a disk block 00 01 10 11 0000 0010 0011 0101 0110 1000 1001 1010 Disk Blocks (M = 3) 1110 18

Extendible Hashing: More details Extendible Hashing: Insert: If leaf is full, split leaf and increase directory bits by one (e.g. 000, 001, 010, etc.) To avoid collisions and too much splitting, would like bits to be nearly random Hash keys to long integers and then look at leading bits Hash(X) = first 2 bits of X Directory 00 01 10 11 0000 0010 0011 0101 0110 1000 1001 1010 1110 Disk Blocks (M = 3) 19 Applications of Hashing In Compilers: Used to keep track declared variables in source code this hash table is known as the Symbol Table. In storing information associated with strings Example: Counting word frequencies in a text! (as in HW 3) In Game playing programs: Store the move for each position by hashing that position into a hash table Called the Transposition table In on-line spell checkers like this ome Entire dictionary stored in a hash table Each word in text hashed if not found, word is misspelled. 20

Next Class: All sorts of Sorts To Do: Finish reading Chapter 5 Assignment # 3 (due Thu Feb 13) Midterm on Wed Feb 12! 21