Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search



Similar documents
Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Fundamental Algorithms

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)

Review of Hashing: Integer Keys

Lecture 2 February 12, 2003

Analysis of Binary Search algorithm and Selection Sort algorithm

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

Binary search algorithm

Hash Tables. Computer Science E-119 Harvard Extension School Fall 2012 David G. Sullivan, Ph.D. Data Dictionary Revisited

Record Storage and Primary File Organization

DATA STRUCTURES USING C

BM307 File Organization

6. Standard Algorithms

CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions. Linda Shapiro Spring 2016

Converting a Number from Decimal to Binary

Zabin Visram Room CS115 CS126 Searching. Binary Search

Practical Survey on Hash Tables. Aurelian Țuțuianu

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Binary Heap Algorithms

The Advantages and Disadvantages of Network Computing Nodes

Cpt S 223. School of EECS, WSU

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Sample Questions Csci 1112 A. Bellaachia

- Easy to insert & delete in O(1) time - Don t need to estimate total memory needed. - Hard to search in less than O(n) time

root node level: internal node edge leaf node Data Structures & Algorithms McQuain

Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

Big Data & Scripting Part II Streaming Algorithms

Chapter 13. Disk Storage, Basic File Structures, and Hashing

Quiz 4 Solutions EECS 211: FUNDAMENTALS OF COMPUTER PROGRAMMING II. 1 Q u i z 4 S o l u t i o n s

Chapter 13: Query Processing. Basic Steps in Query Processing

RSA Question 2. Bob thinks that p and q are primes but p isn t. Then, Bob thinks Φ Bob :=(p-1)(q-1) = φ(n). Is this true?

Discrete Math in Computer Science Homework 7 Solutions (Max Points: 80)

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 13-1

INTRODUCTION The collection of data that makes up a computerized database must be stored physically on some computer storage medium.

Chapter 13 Disk Storage, Basic File Structures, and Hashing.

Advanced Cryptography

Faster deterministic integer factorisation

Elementary factoring algorithms

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

LSN 2 Number Systems. ECT 224 Digital Computer Fundamentals. Department of Engineering Technology

Merkle Hash Trees for Distributed Audit Logs

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, Notes on Algebra

arxiv: v1 [math.pr] 5 Dec 2011

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Chapter 6: Physical Database Design and Performance. Database Development Process. Physical Design Process. Physical Database Design

Analysis of Algorithms I: Binary Search Trees

APP INVENTOR. Test Review

A Survey on Efficient Hashing Techniques in Software Configuration Management

Display Format To change the exponential display format, press the [MODE] key 3 times.

I. Introduction. MPRI Cours Lecture IV: Integer factorization. What is the factorization of a random number? II. Smoothness testing. F.

CSC148 Lecture 8. Algorithm Analysis Binary Search Sorting

Lecture 1: Data Storage & Index

Chapter 6: Episode discovery process

Chapter 3. if 2 a i then location: = i. Page 40

Ex. 2.1 (Davide Basilio Bartolini)

Factoring Algorithms

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

DNS LOOKUP SYSTEM DATA STRUCTURES AND ALGORITHMS PROJECT REPORT

Breaking The Code. Ryan Lowe. Ryan Lowe is currently a Ball State senior with a double major in Computer Science and Mathematics and

Why you shouldn't use set (and what you should use instead) Matt Austern

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

Analysis of a Search Algorithm

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

DATABASE DESIGN - 1DL400

Speedy Signature Based Intrusion Detection System Using Finite State Machine and Hashing Techniques

FX 115 MS Training guide. FX 115 MS Calculator. Applicable activities. Quick Reference Guide (inside the calculator cover)

Linear Codes. Chapter Basics

The Set Data Model CHAPTER What This Chapter Is About

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Integer Factorization using the Quadratic Sieve

Lecture 13: Factoring Integers

Bounded Cost Algorithms for Multivalued Consensus Using Binary Consensus Instances

File System Management

The Goldberg Rao Algorithm for the Maximum Flow Problem

Gambling and Data Compression

THE NUMBER OF REPRESENTATIONS OF n OF THE FORM n = x 2 2 y, x > 0, y 0

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

Lecture 3: Finding integer solutions to systems of linear equations

Why? A central concept in Computer Science. Algorithms are ubiquitous.

Computer and Network Security

A Comparison of Dictionary Implementations

H/wk 13, Solutions to selected problems

Cuckoo Filter: Practically Better Than Bloom

The Taxman Game. Robert K. Moniot September 5, 2003

CS/COE

CPSC 211 Data Structures & Implementations (c) Texas A&M University [ 313]

Cours de C++ Utilisations des conteneurs

Determining the Optimal Combination of Trial Division and Fermat s Factorization Method

New Hash Function Construction for Textual and Geometric Data Retrieval

MAC Sublayer. Abusayeed Saifullah. CS 5600 Computer Networks. These slides are adapted from Kurose and Ross

Factoring & Primality

The Method of Partial Fractions Math 121 Calculus II Spring 2015

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Math Review. for the Quantitative Reasoning Measure of the GRE revised General Test

Transcription:

Chapter Objectives Chapter 9 Search Algorithms Data Structures Using C++ 1 Learn the various search algorithms Explore how to implement the sequential and binary search algorithms Discover how the sequential and binary search algorithms perform Become aware of the lower bound on comparison-based search algorithms Learn about hashing Data Structures Using C++ 2 Sequential Search template<class elemtype> int arraylisttype<elemtype>::seqsearch(const elemtype& item) int loc; bool found = false; for(loc = 0; loc < length; loc++) if(list[loc] == item) found = true; break; } if(found) return loc; What is the time complexity? return -1; }//end seqsearch Data Structures Using C++ 3 Search Algorithms Search item: target To determine the average number of comparisons in the successful case of the sequential search algorithm: Consider all possible cases Find the number of comparisons for each case Add the number of comparisons and divide by the number of cases Data Structures Using C++ 4 Search Algorithms Suppose that there are n elements in the list. The following expression gives the average number of comparisons, assuming that each element is equally likely to be sought: Binary Search (assumes list is sorted) It is known that Therefore, the following expression gives the average number of comparisons made by the sequential search in the successful case: Data Structures Using C++ 5 Data Structures Using C++ 6

Binary Search: middle element first + last mid = 2 Data Structures Using C++ 7 Binary Search template<class elemtype> int orderedarraylisttype<elemtype>::binarysearch (const elemtype& item) int first = 0; int last = length - 1; int mid; bool found = false; while(first <= last &&!found) mid = (first + last) / 2; if(list[mid] == item) found = true; if(list[mid] > item) last = mid - 1; first = mid + 1; } if(found) return mid; return 1; }//end binarysearch Data Structures Using C++ 8 Binary Search: Example Binary Search: Example Unsuccessful search Total number of comparisons is 6 Data Structures Using C++ 9 Data Structures Using C++ 10 Performance of Binary Search Performance of Binary Search Data Structures Using C++ 11 Data Structures Using C++ 12

Performance of Binary Search Unsuccessful search for a list of length n, a binary search makes approximately 2 log 2 (n + 1) key comparisons Search Algorithm Analysis Summary Successful search for a list of length n, on average, a binary search makes 2 log 2 n 4 key comparisons Worst case upper bound: 2 + 2 log 2 n Data Structures Using C++ 13 Data Structures Using C++ 14 Lower Bound on Comparison- Based Search Definition: A comparison-based search algorithm performs its search by repeatedly comparing the target element to the list elements. Theorem: Let L be a list of size n > 1. Suppose that the elements of L are sorted. If SRH(n) denotes the minimum number of comparisons needed, in the worst case, by using a comparison-based algorithm to recognize whether an element x is in L, then SRH(n) = log 2 (n + 1). If list not sorted, worst case is n comparisons Corollary: The binary search algorithm is the optimal worst-case algorithm for solving search problems by the comparison method (when the list is sorted). For unsorted lists, sequential search is optimal Hashing An alternative to comparison-based search Requires storing data in a special data structure, called a hash table Main objectives to choosing hash functions: Choose a hash function that is easy to compute Minimize the number of collisions Data Structures Using C++ 15 Data Structures Using C++ 16 Commonly Used Hash Functions Mid-Square Hash function, h, computed by squaring the identifier Using appropriate number of bits from the middle of the square to obtain the bucket address Middle bits of a square usually depend on all the characters, it is expected that different keys will yield different hash addresses with high probability, even if some of the characters are the same Commonly Used Hash Functions Folding Key X is partitioned into parts such that all the parts, except possibly the last parts, are of equal length Parts then added, in convenient way, to obtain hash address Division (Modular arithmetic) Key X is converted into an integer ix This integer divided by size of hash table to get remainder, giving address of X in HT Data Structures Using C++ 17 Data Structures Using C++ 18

Commonly Used Hash Functions Suppose that each key is a string. The following C++ function uses the division method to compute the address of the key: int hashfunction(char *key, int keylength) int sum = 0; for(int j = 0; j <= keylength; j++) sum = sum + static_cast<int>(key[j]); return (sum % HTSize); }//end hashfunction Collision Resolution Algorithms to handle collisions Two categories of collision resolution techniques Open addressing (closed hashing) Chaining (open hashing) Data Structures Using C++ 19 Data Structures Using C++ 20 Collision Resolution: Open Addressing Pseudocode implementing linear probing: hindex = hashfunction(insertkey); found = false; while(ht[hindex]!= emptykey &&!found) if(ht[hindex].key == key) found = true; hindex = (hindex + 1) % HTSize; if(found) cerr<< Duplicate items are not allowed. <<endl; HT[hIndex] = newitem; Data Structures Using C++ 21 Linear Probing 9 will be next location if h(x) = 6,7,8, or 9 Probability of 9 being next = 4/20, for 14, it s 5/20, but only 1/20 for 0 or 1 Clustering Data Structures Using C++ 22 Random Probing Uses a random number generator to find the next available slot ith slot in the probe sequence is: (h(x) + r i ) % HTSize where r i is the ith value in a random permutation of the numbers 1 to HTSize 1 All insertions and searches use the same sequence of random numbers Quadratic Probing ith slot in the probe sequence is: (h(x) + i 2 ) % HTSize (start i at 0) Reduces primary clustering of linear probing We do not know if it probes all the positions in the table When HTSize is prime, quadratic probing probes about half the table before repeating the probe sequence Data Structures Using C++ 23 Data Structures Using C++ 24

Deletion: Open Addressing Deletion: Open Addressing When deleting, need to remove the item from its spot, but cannot reset it to empty (Why?) IndexStatusList[i] set to 1 to mark item i as deleted Data Structures Using C++ 25 Data Structures Using C++ 26 Collision Resolution: Chaining (Open Hashing) Hashing Analysis Let Then a is called the load factor No probing needed; instead put linked list at each hash position Data Structures Using C++ 27 Data Structures Using C++ 28 Average Number of Comparisons Linear probing: Successful search: Unsuccessful search: Chaining: Average Number of Comparisons 1. Successful search Quadratic probing: Successful search: 2. Unsuccessful search Unsuccessful search: Data Structures Using C++ 29 Data Structures Using C++ 30

Hashing 1 Searching Lists There are many instances when one is interested in storing and searching a list: A phone company wants to provide caller ID: Given a phone number a name is returned. Somebody who plays the Lotto thinks they can increase their chance to win if they keep track of the winning number from the last 3 years. Both of these have simple solutions using arrays, linked lists, binary trees, etc. Unfortunately, none of these solutions is adequate, as we will see. Hashing 4 Hash Table Example I have 5 friends whose phone numbers I want to store in a hash table of size 8. Their names and numbers are: Susy Olson 555-1212 Sarah Skillet 555-4321 Ryan Hamster 545-3241 Justin Case 555-6798 Chris Lindmeyer 535-7869 I use as a hash function h(k) =k mod 8,wherek is the phone number viewed as a 7-digit decimal number. Notice that 5554321 mod 8 = 5453241 mod 8 = 1. But I can t put Sarah Skillet and Ryan Hamster both in the position 1. Can we fix this problem? Hashing 2 Problematic List Searching Here are some solutions to the problems: For caller ID: Use an array indexed by phone number. Requires one operation to return name. An array of size 1,000,000,000 is needed. Use a linked list. This requires O(n) time and space, where n is the number of actual phone numbers. This is the best we can do space-wise, but the time required is horrendous. A balanced binary tree: O(n) space and O(log n) time. Better, but not good enough. For the Lotto we could use the same solutions. The number of possible lotto number s is 15,625,000 for a 6/50 lotto. The number of entries stored is on the order of 1000, assuming a daily lotto. Hashing 5 Hash Table Problems The problem with hash tables is that two keys can have the same hash value. This is called collision. There are several ways to deal with this problem: Pick the hash function h to minimize the number of collisions. Implement the hash table in a way that allows keys with the same hash value to all be stored. The second method is almost always needed, even for very good hash functions. Why? We will talk about 2 collision resolution techniques: Chaining Open Addressing But first, a bit more on hash functions Hashing 3 The Hash Table Solution A Hash table is similar to an array: Has fixed size m. An item with key k goes into index h(k), where h is a function from the keyspace to 0, 1,...,m 1} The function h is called a hash function. We call h(k) the hash value of k. A hash table allows us to store values from a large set in a small array in such a way that searching is fast. The space required to store n numbers in a hash table of size m is O(m + n). Notice that this does not depend on the size of the keyspace. The average time required to insert, delete, and search for an element in a hash table is O(1). Sounds perfect. Then what s the problem? Let s look at an example. Hashing 6 Hash Functions Most hash functions assume the keys come from the set 0, 1,...}. If the keys are not natural numbers, some method must be used to convert them. Phone number and ID numbers can be converted by removing the hyphens. Characters can be converted using ASCII. Strings can be converted by converting each character using ASCII, and then interpreting the string of natural numbers as if it where stored base 128. We need to choose a good hash function. A hash function is good if it can be computed quickly, and the keys are distributed uniformly throughout the table.

Hashing 7 Some Good Hash Functions The division method: h(k) =k mod m The modulus m must be chosen carefully. Powers of 2 and 10 can be bad. Why? Prime numbers not too close to powers of 2 are a good choice. We can pick m by choosing an appropriate prime number that is close to the table size we want. The multiplication method: Let A be a constant with 0 <A<1. Thenweuse h(k) = m(ka mod 1), where ka mod 1 means the fractional part of ka. The choice of m is not as critical here. We can choose m to make the implementation easy and/or fast. Hashing 10 Collision Resolution: Open Addressing With Open Addressing, all elements are stored in the hash table. This means several things Less memory is used than with chaining since we don t have to store pointers. The hash table has an absolute size limit of m. Thus, planning ahead is important when using open addressing. We must have a way to store multiple elements with the same hash value. Instead of a hash function, we need to use a probe sequence. That is, a sequence of hash values. We go through the sequence one by one until we find an empty position. For searching, we do the same thing, skipping values that do not match our key. Hashing 8 Universal Hashing Let x and y be distinct keys. Let H be a set of hash functions. His called universal if the number of functions h Hfor which h(x) =h(y) is precisely H /m. In other words, if we pick a random h H,the probability that x and y collide under h is 1/m. Universal hashing can be useful in many situations. Youareaskedtocomeupwithahashing technique for your boss. Your boss tells you that after you are done, he will pick some keys to hash. If you get too many collisions, he will fire you. If you use universal hashing, the only way he can succeed is by getting lucky. Why? Hashing 11 Probe Sequences We define our hash functions with an extra parameter, the probe number i. Thus our hash functions look like h(k, i). We compute h(k, 0), h(k, 1),..., h(k, i) until h(k, i) is empty. The hash function h must be such that the probe sequence h(k, 0),...h(k, m 1) is a permutation of 0, 1,...,m 1. Why? Insertion and searching are fairly quick, assuming a good implementation. Deletion can be a problem because the probe sequence can be complex. We will discuss 2 ways of defining probe sequences: Linear Probing Double Hashing Hashing 9 Collision Resolution: Chaining With chaining, we set up an array of links, indexed by the hash values, to lists of items with the same hash value. 0 32 5 1 2 13 4 65 3 4 21 Let n be the number of keys, and m the size of the hash table. Define the load factor α = n/m. Successful and unsuccessful searching both take time Θ(1 + α) on average, assuming simple uniform hashing. By simple uniform hashing we mean that a key has equal probability to hash into any of the m slots, independent of the other elements of the table. Hashing 12 Linear Probing Let g be a hash function. When we use linear probing, the probe sequence is computed by h(k, i) =(g(k)+i) modm. When we insert an element, we start with the hash value, and proceed element by element until we findanemptyslot. For searching, we start with the hash value and proceed element by element until we find the key we are looking for. Example: Let g(k) = k mod 13. We will insert the following keys into the hash table: 18, 41, 22, 44, 59, 32, 31, 73 0 1 2 3 4 5 6 7 8 9 10 11 12 Problem: The values in the table tend to cluster.

Chaining or Open Addressing? Hashing Performance: Chaining Verses Open Addressing 20 1 Ch: I 1+x Ch: S 1/(1-x) OA: I/US 15 (1/x)*log(1/(1-x))+1/x OA: SS 10 5 0 0 0.2 0.4 0.6 0.8 1 Hashing 13 Double Hashing With double hashing we use two hash functions. Let h1 and h2 be hash functions. We define our probe sequence by h(k, i) =(h1(k)+i h2(k)) mod m. Often pick m =2 n for some n, and ensure that h2(k) is always odd. Double hashing tends to distribute keys more uniformly than linear probing. Hashing 16 Hashing 14 Double Hashing Example Let h1 = k mod 13 Let h2 =1+(k mod 8),and Let the hash table have size 13. Then our probe sequence is defined by h(k, i) =(h1(k)+i h2(k)) mod 13 =(k mod 13 + i(1 + (k mod 8))) mod 13 Insert the following keys into the table: 18, 41, 22, 44, 59, 32, 31, 73 0 1 2 3 4 5 6 7 8 9 10 11 12 Hashing 15 Open Addressing: Performance Again, we set α = n/m.thisistheaverage number of keys per array index. Note that α 1 for open addressing. We assume a good probe sequence has been used. That is, for any key, each permutation of 0, 1...,m 1 is equally likely as a probe sequence. The average number of probes for insertion or unsuccessful search is at most 1/(1 α). The average number of probes for a successful search is at most 1 α ln 1 1 α + 1 α.