Fundamental Algorithms



Similar documents
Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)

Record Storage and Primary File Organization

Review of Hashing: Integer Keys

BM307 File Organization

Hash Tables. Computer Science E-119 Harvard Extension School Fall 2012 David G. Sullivan, Ph.D. Data Dictionary Revisited

CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions. Linda Shapiro Spring 2016

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

A COOL AND PRACTICAL ALTERNATIVE TO TRADITIONAL HASH TABLES

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

INTRODUCTION The collection of data that makes up a computerized database must be stored physically on some computer storage medium.

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

Notes on Factoring. MA 206 Kurt Bryan

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Physical Data Organization

Practical Survey on Hash Tables. Aurelian Țuțuianu

Factoring & Primality

Lecture 1: Data Storage & Index

DATABASE DESIGN - 1DL400

Lecture 2 February 12, 2003

A Survey on Efficient Hashing Techniques in Software Configuration Management

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Breaking The Code. Ryan Lowe. Ryan Lowe is currently a Ball State senior with a double major in Computer Science and Mathematics and

APP INVENTOR. Test Review

CS 464/564 Introduction to Database Management System Instructor: Abdullah Mueen

Database Systems. Session 8 Main Theme. Physical Database Design, Query Execution Concepts and Database Programming Techniques

A Comparison of Dictionary Implementations

Project Group High- performance Flexible File System 2010 / 2011

Storage Management for Files of Dynamic Records

CHAPTER 5. Number Theory. 1. Integers and Division. Discussion

New Hash Function Construction for Textual and Geometric Data Retrieval

Analysis of a Search Algorithm

Operating Systems CSE 410, Spring File Management. Stephen Wagner Michigan State University

Big Data & Scripting Part II Streaming Algorithms

Cuckoo Filter: Practically Better Than Bloom

Zabin Visram Room CS115 CS126 Searching. Binary Search

Determining the Optimal Combination of Trial Division and Fermat s Factorization Method

Storage and File Structure

Original-page small file oriented EXT3 file storage system

Data Structures and Algorithm Analysis (CSC317) Intro/Review of Data Structures Focus on dynamic sets

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

DATA STRUCTURES USING C

DNS LOOKUP SYSTEM DATA STRUCTURES AND ALGORITHMS PROJECT REPORT

The finite field with 2 elements The simplest finite field is

CHAPTER 13: DISK STORAGE, BASIC FILE STRUCTURES, AND HASHING

6. Storage and File Structures

Study of algorithms for factoring integers and computing discrete logarithms

Performance Tuning for the Teradata Database

7.1 Our Current Model

File System Management

SOLUTIONS FOR PROBLEM SET 2

Chapter 8: Structures for Files. Truong Quynh Chi Spring- 2013

QUEUES. Primitive Queue operations. enqueue (q, x): inserts item x at the rear of the queue q

Data Structures in Java. Session 15 Instructor: Bert Huang

Factoring Algorithms

FAREY FRACTION BASED VECTOR PROCESSING FOR SECURE DATA TRANSMISSION

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

Primality - Factorization

= = 3 4, Now assume that P (k) is true for some fixed k 2. This means that

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, Notes on Algebra

The Mixed Binary Euclid Algorithm

Elementary Number Theory and Methods of Proof. CSE 215, Foundations of Computer Science Stony Brook University

Analysis of Binary Search algorithm and Selection Sort algorithm

Positional Numbering System

Lecture 3: Finding integer solutions to systems of linear equations

2) What is the structure of an organization? Explain how IT support at different organizational levels.

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Factoring Algorithms

Chapter 11: File System Implementation. Chapter 11: File System Implementation. Objectives. File-System Structure

GCE Computing. COMP3 Problem Solving, Programming, Operating Systems, Databases and Networking Report on the Examination.

CONTINUED FRACTIONS AND FACTORING. Niels Lauritzen

Scalable Prefix Matching for Internet Packet Forwarding

SHARED HASH TABLES IN PARALLEL MODEL CHECKING

Binary Heap Algorithms

Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs)

2) Based on the information in the table which choice BEST shows the answer to 1 906?

Sheet 7 (Chapter 10)

Florida Math Correlation of the ALEKS course Florida Math 0018 to the Florida Mathematics Competencies - Lower

Part 5: More Data Structures for Relations

1. Define: (a) Variable, (b) Constant, (c) Type, (d) Enumerated Type, (e) Identifier.

Abstract Data Type. EECS 281: Data Structures and Algorithms. The Foundation: Data Structures and Abstract Data Types

Introduction to Data Structures and Algorithms

Unit Storage Structures 1. Storage Structures. Unit 4.3

Table of Contents. Bibliografische Informationen digitalisiert durch

Integer Factorization using the Quadratic Sieve

PRIME FACTORS OF CONSECUTIVE INTEGERS

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called heap-order property

Cryptography and Network Security Department of Computer Science and Engineering Indian Institute of Technology Kharagpur

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

6. Standard Algorithms

Copy in your notebook: Add an example of each term with the symbols used in algebra 2 if there are any.

Lukasz Pater CMMS Administrator and Developer

Transcription:

Fundamental Algorithms Chapter 7: Hash Tables Michael Bader Winter 2014/15 Chapter 7: Hash Tables, Winter 2014/15 1

Generalised Search Problem Definition (Search Problem) Input: a sequence or set A of n elements A, and an x A. Output: Index i {1,..., n with x = A[i], or NIL, if x A. complexity depends on data structure complexity of operations to set up data structure? (insert/delete) Definition (Generalised Search Problem) Store a set of objects consisting of a key and additional data: Object := ( key : Integer,. record : Data ) ; search/insert/delete objects in this set Chapter 7: Hash Tables, Winter 2014/15 2

Direct-Address Tables Definition (table as data structure) similar to array: access element via index usually contains elements only for some of the indices Direct-Address Table: assume: limited number of values for the keys: U = {0, 1,..., m 1 allocate table of size m use keys directly as index Chapter 7: Hash Tables, Winter 2014/15 3

Direct-Address Tables (2) D i r A d d r I n s e r t ( T : Table, x : Object ) { T [ x. key ] := x ; DirAddrDelete ( T : Table, x : Object ){ T [ x. key ] := NIL ; DirAddrSearch ( T : Table, key : Integer ){ return T [ key ] ; Chapter 7: Hash Tables, Winter 2014/15 4

Direct-Address Tables (3) Advantage: very fast: search/delete/insert is Θ(1) Disadvantages: m has to be small, or otherwise, the table has to be very large! if only few elements are stored, lots of table elements are unused (waste of memory) all keys need to be distinct (they should be, anyway) Chapter 7: Hash Tables, Winter 2014/15 5

Hash Tables Idea: compute index from key Wanted: function h that maps a given key to an index, has a relatively small range of values, and can be computed efficiently, Definition (hash function, hash table) Such a function h is called a hash function. The respective table is called a hash table. Chapter 7: Hash Tables, Winter 2014/15 6

Hash Tables Insert, Delete, Search HashInsert ( T : Table, x : Object ) { T [ h ( x. key ) ] := x ; HashDelete ( T : Table, x : Object ) { T [ h ( x. key ) ] : = NIL ; HashSearch ( T : Table, x : Object ) { return T [ h ( x. key ) ] ; Chapter 7: Hash Tables, Winter 2014/15 7

So Far: Naive Hashing Advantages: still very fast: search/delete/insert is Θ(1), if h is Θ(1) size of the table can be chosen freely, provided there is an appropriate hash function h Disadvantages: values of h have to be distinct for all keys however: impossible to find a hash function that produces distinct values for any set of stored data ToDo: deal with collisions: objects with different keys that share a common hash value have to be stored in the same table element Chapter 7: Hash Tables, Winter 2014/15 8

Resolve Collisions by Chaining Idea: use a table of containers containers can hold an arbitrarily large amount of data using (linked) lists as containers: chaining ChainHashInsert ( T : Table, x : Object ) { i n s e r t x i n t o T [ h ( x. key ) ] ; ChainHashDelete ( T : Table, x : Object ) { d e lete x from T [ h ( x. key ) ] ; Chapter 7: Hash Tables, Winter 2014/15 9

Resolve Collisions by Chaining ChainHashSearch ( T : Table, x : Object ) { return ListSearch ( x, T [ h ( x. key ) ] ) ;! r e s u l t : reference to x or NIL, i f x not found ; Advantages: hash function no longer has to return distinct values still very fast, if the lists are short Disadvantages: delete/search is Θ(k), if k elements are in the accessed list worst case: all elements stored in one single list (very unlikely). Chapter 7: Hash Tables, Winter 2014/15 10

Chaining Average Search Complexity Assumptions: hash table has m slots (table of m lists) contains n elements load factor: α = n m h(k) can be computed in O(1) for all k all values of h are equally likely to occur Search complexity: on average, the list corresponding to the requested key will have α elements unsuccessful search: compare the requested key with all objects in the list, i.e. O(α) operations successful search: requested key last in the list; also O(α) operations Expected: Average complexity: O(1 + α) operations Chapter 7: Hash Tables, Winter 2014/15 11

Hash Functions A good hash function should: satisfy the assumption of even distribution: each key is equally likely to be hashed to any of the slots: k : h(k)=j be easy to compute (P(key = k)) = 1 m for all j = 0,..., m 1 be non-smooth : keys that are close together should not produce hash values that are close together (to avoid clustering) Simplest choice: h = k mod m (m a prime number) easy to compute; even distribution if keys evenly distributed however: not non-smooth Chapter 7: Hash Tables, Winter 2014/15 12

The Multiplication Method for Integer Keys Two-step method 1. multiply k by constant 0 < γ < 1, and extract fractional part of kγ 2. multiply by m, and use integer part as hash value: h(k) := m(γk mod 1) = m(γk γk ) Remarks: value of m uncritical; e.g. m = 2 p value of γ needs to be chosen well in practice: use fix-point arithmetics non-integer keys: use encoding to integers (ASCII, byte encoding,... ) Chapter 7: Hash Tables, Winter 2014/15 13

Open Addressing Definition no containers: table contains objects each slot of the hash table either contains an object or NIL to resolve collisions, more than one position is allowed for a specific key Hash function: generates sequence of hash table indices: General approach: h : U {0,..., m 1 {0,..., m 1 store object in the first empty slot specified by the probe sequence empty slot in the hash table guaranteed, if the probe sequence h(k, 0), h(k, 1),..., h(k, m 1) is a permutation of 0, 1,..., m 1 Chapter 7: Hash Tables, Winter 2014/15 14

Open Addressing Algorithms OpenHashInsert ( T : Table, x : Object ) : Integer { f o r i from 0 to m 1 do { j := h ( x. key, i ) ; i f T [ j ]= NIL then { T [ j ] := x ; return j ; cast e r r o r hash t a b l e overflow OpenHashSearch ( T : Table, k : Integer ) : Object { i := 0; while T [ h ( k, i ) ] <> NIL and i < m { i f k = T [ h ( k, i ) ]. key then return T [ h ( k, i ) ] ; i := i +1; return NIL ; Chapter 7: Hash Tables, Winter 2014/15 15

Open Addressing Linear Probing Hash function: h(k, i) := (h 0 (k) + i) mod m first slot to be checked is T[h 0 (k)] second probe slot is T[h 0 (k) + 1], then T[h 0 (k) + 2], etc. wrap around to T[0] after T[m 1] has been checked Main problem: clustering continuous sequences of occupied slots ( clusters ) cause lots of checks during searching and inserting clusters tend to grow, because all objects that are hashed to a slot inside the cluster will increase it slight (but minor) improvement: h(k, i) := (h 0 (k) + ci) mod m Main advantage: simple and fast easy to implement cache efficient! Chapter 7: Hash Tables, Winter 2014/15 16

Open Addressing Quadratic Probing Hash function: h(k, i) := (h 0 (k) + c 1 i + c 2 i 2 ) mod m how to chose constants c 1 and c 2? objects with identical h 0 (k) still have the same sequence of hash values ( secondary clustering ) Idea: double hashing h(k, i) := (h 0 (k) + i h 1 (k)) mod m if h 0 is identical for two keys, h 1 will generate different probe sequences Chapter 7: Hash Tables, Winter 2014/15 17

Open Addressing Double Hashing h(k, i) := (h 0 (k) + i h 1 (k)) mod m How to choose h 0 and h 1 : range of h 0 : U {0,..., m 1 (cover entire table) h 1 (k) must never be 0 (no probe sequence generated) h 1 (k) should be prime to m for all k probe sequence will try all slots if d is the greatest common divisor of h 1 (k) and m, only 1 d hash slots will be probed of the Possible choices: m = 2 M and let h 1 generate odd numbers, only m a prime number, and h 1 : U {1,..., m 1 with m 1 < m Chapter 7: Hash Tables, Winter 2014/15 18

Collisions and Clustering Scenarios for Collisions: keys share the same primary hash value: h(k 1, 0) = h(k 2, 0) same sequence of hash values for linear and quadratic probing keys share a value of the hash sequence: h(k 1, i) = h(k 2, j) same sequence of hash values for linear probing different hash values for next try: h(k 1, i + 1) h(k 2, j + 1) Example: multiple keys that share the same has values linear hashing will cause primary cluster cluster will also grow by all keys mapped to a hash value within this cluster Chapter 7: Hash Tables, Winter 2014/15 19

Open Addressing Deletion Problem remaining: how to delete? search entry, remove it does not work: insert 3, 7, 8 having same hash-value, then delete 7 how to find 8? do not delete, just mark as deleted Next problem: searching stops if first empty entry found after many deletions: lots of unnecessary comparisons! Chapter 7: Hash Tables, Winter 2014/15 20

Open Addressing Deletion (2) Deletion general problem for open hashing only solution : new construction of table after some deletions hash tables therefore commonly don t support deletion Inserting inserting efficient, but too many inserts not enough space if ratio α too big, new construction of table with larger size Still... searching faster than O(log n) possible Chapter 7: Hash Tables, Winter 2014/15 21