Lecture 2 February 12, 2003



Similar documents
Analysis of Algorithms I: Binary Search Trees

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

Cpt S 223. School of EECS, WSU

Chapter Objectives. Chapter 9. Sequential Search. Search Algorithms. Search Algorithms. Binary Search

Algorithms Chapter 12 Binary Search Trees

An Evaluation of Self-adjusting Binary Search Tree Techniques

Complexity of Union-Split-Find Problems. Katherine Jane Lai

Lecture 1: Data Storage & Index

B+ Tree Properties B+ Tree Searching B+ Tree Insertion B+ Tree Deletion Static Hashing Extendable Hashing Questions in pass papers

Binary Heap Algorithms

Faster Deterministic Dictionaries

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)

Operations: search;; min;; max;; predecessor;; successor. Time O(h) with h height of the tree (more on later).

Persistent Data Structures

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman

Big Data and Scripting. Part 4: Memory Hierarchies

A binary search tree is a binary tree with a special property called the BST-property, which is given as follows:

root node level: internal node edge leaf node Data Structures & Algorithms McQuain

1) The postfix expression for the infix expression A+B*(C+D)/F+D*E is ABCD+*F/DE*++

Dynamic 3-sided Planar Range Queries with Expected Doubly Logarithmic Time

Persistent Data Structures and Planar Point Location

Confluently Persistent Tries for Efficient Version Control

A binary search tree or BST is a binary tree that is either empty or in which the data element of each node has a key, and:

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Binary Trees and Huffman Encoding Binary Search Trees

Binary Heaps * * * * * * * / / \ / \ / \ / \ / \ * * * * * * * * * * * / / \ / \ / / \ / \ * * * * * * * * * *

Lecture Notes on Binary Search Trees

D1.1 Service Discovery system: Load balancing mechanisms

DATABASE DESIGN - 1DL400

Binary Search Trees. Each child can be identied as either a left or right. parent. right. A binary tree can be implemented where each node

Chapter 14 The Binary Search Tree

1. The memory address of the first element of an array is called A. floor address B. foundation addressc. first address D.

CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions. Linda Shapiro Spring 2016

Algorithms. Margaret M. Fleck. 18 October 2010

CPSC 211 Data Structures & Implementations (c) Texas A&M University [ 221] edge. parent

Data Warehousing und Data Mining

Previous Lectures. B-Trees. External storage. Two types of memory. B-trees. Main principles

Quick Sort. Implementation

Learning Outcomes. COMP202 Complexity of Algorithms. Binary Search Trees and Other Search Trees

Introduction to Algorithms March 10, 2004 Massachusetts Institute of Technology Professors Erik Demaine and Shafi Goldwasser Quiz 1.

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Union-Find Problem. Using Arrays And Chains

CS711008Z Algorithm Design and Analysis

Heaps & Priority Queues in the C++ STL 2-3 Trees

Questions 1 through 25 are worth 2 points each. Choose one best answer for each.

Binary Search Trees CMPSC 122

Sequential Data Structures

Hash-based Digital Signature Schemes

Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Persistent Binary Search Trees

Multi-dimensional index structures Part I: motivation

Lecture 4: Balanced Binary Search Trees

Full and Complete Binary Trees

Fundamental Algorithms

SMALL INDEX LARGE INDEX (SILT)

The Advantages and Disadvantages of Network Computing Nodes

MAX = 5 Current = 0 'This will declare an array with 5 elements. Inserting a Value onto the Stack (Push)

Symbol Tables. Introduction

CSE 326: Data Structures B-Trees and B+ Trees

INTRODUCTION The collection of data that makes up a computerized database must be stored physically on some computer storage medium.

Binary Search Tree Intro to Algorithms Recitation 03 February 9, 2011

A binary heap is a complete binary tree, where each node has a higher priority than its children. This is called heap-order property

Data Structures and Algorithm Analysis (CSC317) Intro/Review of Data Structures Focus on dynamic sets

A Comparison of Dictionary Implementations

DATA STRUCTURES USING C

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

The Set Data Model CHAPTER What This Chapter Is About

Algorithms and Data Structures

B-Trees. Algorithms and data structures for external memory as opposed to the main memory B-Trees. B -trees

Analysis of Algorithms, I

Chapter 8: Structures for Files. Truong Quynh Chi Spring- 2013

Physical Data Organization

Cuckoo Filter: Practically Better Than Bloom

Home Page. Data Structures. Title Page. Page 1 of 24. Go Back. Full Screen. Close. Quit

Data Structures. Jaehyun Park. CS 97SI Stanford University. June 29, 2015

CS104: Data Structures and Object-Oriented Design (Fall 2013) October 24, 2013: Priority Queues Scribes: CS 104 Teaching Team

Data Structure with C

Packet forwarding using improved Bloom filters

Record Storage and Primary File Organization

Data Structures For IP Lookup With Bursty Access Patterns

Introduction to Data Structures and Algorithms

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

Ordered Lists and Binary Trees

Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

File System Management

Algorithms and Data Structures

Algorithms and Data Structures Written Exam Proposed SOLUTION

DNS LOOKUP SYSTEM DATA STRUCTURES AND ALGORITHMS PROJECT REPORT

7.1 Our Current Model

Journal of Systems Architecture

How Asymmetry Helps Load Balancing

MAC Sublayer. Abusayeed Saifullah. CS 5600 Computer Networks. These slides are adapted from Kurose and Ross

Practical Survey on Hash Tables. Aurelian Țuțuianu

6 March Array Implementation of Binary Trees

Lecture 2: Data Structures Steven Skiena. skiena

Transcription:

6.897: Advanced Data Structures Spring 003 Prof. Erik Demaine Lecture February, 003 Scribe: Jeff Lindy Overview In the last lecture we considered the successor problem for a bounded universe of size u. We began looking at the van Emde Boas [3] data structure, which implements Insert, Delete, Successor, and Predecessor in O(lg ) time per operation. In this lecture we finish up van Emde Boas, and improve the space complexity from our original O(u) to O(n). We also look at perfect hashing (first static, then dynamic), using it to improve the space complexity of van Emde Boas and to implement a simpler data structure with the same running time, y-fast trees. van Emde Boas. Pseudocode for veb operations We start with pseudocode for Insert, Delete, and Successor. (Predecessor is symmetric to Successor.) Insert(x, S) if x <min[s]: swap x and min[s] if min[sub[s][high(x)]] = nil : // was empty Insert(low(x), sub[s][high(x)]) min[sub[s][high(x)]] low(x) else: Insert(high(x), summary[s]) if x > max[s]: max[s] x Delete(x, S) if min[s] = nil or x < min[s]: return if min[s] = x: i min[summary[s]]

x i S + min[sub[s][i]] min[s] x Delete(low(x, sub[w][high(x)]) if min[sub[s][high(x)]] = nil : // now empty Delete(high(x), summary[s]) // in this case, the first recursive call was cheap Successor(x, S) if x < min[s]: return min[s] if low(x) < max[sub[s][high(x)]] return high(x) S + Successor(low(x), sub[s][high(x)]) else: i Successor(high(x), summary[s]) return i S + min[sub[s][i]]. Tree view of van Emde Boas The van Emde Boas data structure can be viewed as a tree of trees. The upper and lower halves of the tree are of height, that is, we are cutting our tree in halves by level. The upper tree has u nodes, as does each of the subtrees hanging off its leaves. (These subtrees correspond to the sub[s] data structures in Section..) Summary Bits / sqrt(u) 0 0 0 0 0 sqrt(u) sqrt(u) sqrt(u) sqrt(u) / 0 0 0 0 0 0 0 0 0 0 0 0 Figure : In this case, we have the set, 9, 0, 5.

We can consider our set as a bit vector (of length 6) at the leaves. Mark all ancestors of present leaves. The function high(x) tells you how to walk through the upper tree. In the same way, low(x) tells you how to walk through the appropriate lower tree to reach x. For instance, consider querying for 9. 9 0 = 00 which is equivalent to right left left right ; high(9) = 0 ( right left ), low(9) = 0 ( left right )..3 Reducing Space in veb It is possible that this tree is wasting a lot of space. Suppose that in one of the subtrees there was nothing whatsoever? Instead of storing all substructures/lower trees explicitly, we can store sub[s] as a dynamic hash table and store only nonempty substructures. We perform an amortized analysis of space usage:. Hash Charge space for hashtable to nonempty substructures, each of which stores an element as min.. Summary Charge elements in the summary structure to the mins of corresponding substructures. When a substructure has just one element, stored in the min field, it does not store a hashtable, so that it can occupy just O() space. In all, the tree as described above uses O(n) space, whereas van Emde Boas takes more space, O(u), as originally considered [3]. For this approach to work, though, we need the following hashtable black box: BLACKBOX for hash table: Insert in O() time, amortized and expected Delete in O() time, amortized and expected Search in O() time, worse case Space is O(n) 3 Transdichotomous Model There are some issues with using the comparison model (too restrictive) and the standard unit cost RAM (too permissive, in that we can fit arbitrarily large data structures into single words). Between the two we have the transdichotomous model. This machine is a modified RAM, and is also called a word RAM, with some set size word upon which you can perform constant time arithmetic. We want to bridge problem size and machine model. So we need a word that is big enough (lg n) to index everything we are talking about. (This is a reasonable model since, say, a 64 bit word can index a truly enormous number of integers, 64.) With this model, we can manipulate a constant number of words in a constant time. 3

4 Perfect Hashing Perfect hashing is a way to implement hash tables such that the expected number of collisions is minimized. We do this by building a hash table of hash tables, where the choice of hash functions for the second level can be redone to avoid clobbering elements through hash collision. We ll start off with the static case; we already have all the elements, and won t be inserting new ones. how many elements hash to this slot a hash function from family H 7 h pointer to second level hash table 7 57 7 entries n...... m i... h j m i entries first level hash table second level hash tables Figure : We re storing n elements from our universe U. h maps some u U into the first level hash table which is of size n. Each entry in the first level hash table has a count of how many elements hash there, as well as a second hash function h i which takes elements to a spot in the appropriate second level hash table. A set H of hash functions of the form h : U {0,,...,m } is called universal if {h H : h(x) = h(y)} = H /m for all x, y U, x y That is, the probability of a hash collision is equal to the size of the universe over the size of the table. Universal sets of hash functions exist. For instance, take key k, and write it in base m: Choose some other vector of this size at random: k = k 0, k,...,k r a = a 0, a,...,a r Then the hash function h a = r a i k i mod m i=0 4

is essentially weighting each k i, via a dot product. So letting the vector a vary, we have a family of hash functions, and it turns out to be universal. Suppose m i elements hash to bucket i (a slot in the first hash table). The expected number of collisions in a bucket is E[# of collisions in bucket i] = x y Pr{x and y collide} Because the second table is of size m i, this sum is equal to ( mi ) m i < (by Markov) The expected total size of all second-level tables is E[space] = E [ n m i i=0 Because the total number of conflicting pairs in the first level can be computed by counting pairs of elements in all second-level tables, this sum is equal to n + E [ n ( mi i=0 ] ) ] = n + ( ) n n < n So our expected space is about n. More specifically, the probability Pr[space > 4n] <. If we clobber an element, we pick some other hash function for that bucket. We will need to choose new hash functions only an expected constant number of times per bucket, because we have a constant probability of success each time. 4. Dynamic Perfect Hashing Making a perfect hash table dynamic is relatively straightforward, and can be done by keeping track of how many elements we have inserted or deleted, and rebuilding (shrinking or growing) the entire hash table. This result is due to Dietzfelbinger et al. [4]. Suppose at time t there are n elements. Construct a static perfect hash table for n elements. If we insert another n elements, we rebuild again, this time a static perfect hash table for 4n elements. As before, whenever there s a collision at the second-level hash table, just rebuild the entire secondlevel table with a new choice of hash function from our family H. This won t happen very often, so amortized the cost is low. If we delete elements until we have n /4, we rebuild for n instead of n. (If we delete enough, we rebuild at half the size.) All operations take amortized O() time. 5

5 y-fast Trees 5. Operation Costs y-fast trees [, ] accomplish the same running times as van Emde Boas O(lg ) for Insert, Delete, Successor, and Predecessor but they are simple once you have dynamic perfect hashing in place. We can define correspondences u U bit string path in a balanced binary search tree. For all x in our set, we will store a dynamic perfect hash table of all prefixes of these strings. Unfortunately, this structure requires O(n ) space, because there are n elements, each of which have prefixes. Also, insertions and deletions cost. We ll fix these high costs later. What does this structure do for Successor, though? To find a successor, we binary search for the lowest filled node on the path to x. We ask, Is half the path there? for successively smaller halves. It takes O(lg ) time to find this node. Suppose we have each node store the min and max element for its subtree (in the hash table entry). This change adds on some constant cost for space. Also, we can maintain a linked list of the entire set of occupied leaves. Then we can find the successor and the predecessor of x in O() time given what we have so far. Using this structure, we can perform Successor/Predecessor in O(lg ) time, and Insert/Delete in O() time, using O(n ) space. 5. Indirection We cluster the n present elements into consecutive groups of size Θ(). Within a group, we use a balanced binary search tree to do Insert/Delete/Successor/Predecessor in O(lg ) time. We use hash table as above to store one representative element per group O(n) space If a search in the hashtable narrows down to groups, look in both. Insert and Delete are delayed by the groups, so it works out to be O() amortized to update the hash table, and O(lg ) total. 6

n elements stored veb Figure 3: Using indirection in y-fast trees. References [] Dan E. Willard, Log-logarithmic worst-case range queries are possible in space Θ(n), Information Processing Letters, 7:8 84. 984. [] Dan E. Willard, New trie data structures which support very fast search operations, Journal of Computer and System Sciences, 8(3):379 394. 984. [3] P. van Emde Boas, Preserving order in a forest in less than logarithmic time and linear space, Information Processing Letters, 6:80 8, 977. [4] M. Dietzfelbinger, A. Karlin, K. Mehlhorn, F. Meyer auf der Heide, H. Rohnert, and R. E. Tarjan, Dynamic perfect hashing: upper and lower bounds, SIAM Journal on Computing, 3:738 76, 994. http://citeseer.nj.nec.com/dietzfelbinger90dynamic.html. 7