van Emde Boas Data Structure

Lecture 15 van Emde Boas Data Structure Supplemental reading in CLRS: Capter 20 Given a fixed integer n, wat is te best way to represent a subset S of {0,..., n 1} in memory, assuming tat space is not a concern? Te simplest way is to use an n-bit array A, setting A[i] = 1 if and only if i S, as we saw in Lecture 10. Tis solution gives O(1) running time for insertions, deletions, and lookups (i.e., testing weter a given number x is in S). 1 Wat if we want our data structure to support more operations, toug? Peraps we want to be able not just to insert, delete and lookup, but also to find te minimum and maximum elements of S. Or, given some element x S, we may want to find te successor and predecessor of x, wic are te smallest element of S tat is greater tan x and te largest element of S tat is less tan x, respectively. On a bit array, tese operations would all take time Θ( S ) in te worst case, as we migt need to examine all te elements of S. Te van Emde Boas (veb) data structure is a clever alternative solution wic outperforms a bit array for tis purpose: Operation Bit array van Emde Boas INSERT O(1) Θ(lg lg n) DELETE O(1) Θ(lg lg n) LOOKUP O(1) Θ(lg lg n) MAXIMUM, MINIMUM Θ(n) O(1) SUCCESSOR, PREDECESSOR Θ(n) Θ(lg lg n) We will not discuss te implementation of SUCCESSOR and PREDECESSOR; tose will be left to recitation. 15.1 Analogy: Te Two-Coconut Problem Te following riddle nicely illustrates te idea of te van Emde Boas data structure. I am somewat embarrassed to give te riddle because it sows a complete misunderstanding of materials science and botany, but it is a standard example and I can t tink of a better one. Problem 15.1. Tere exists some unknown integer k between 1 and 100 (or in general, between 1 and some large integer n) suc tat, wenever a coconut is dropped from a eigt of k inces or 1 Tis performance really is unbeatable, even for small n. Eac operation on te bit array requires only a single memory access.

n k Figure 15.1. One strategy for te two-coconut problem is to divide te integers {1,..., n} into blocks of size, and ten use te first coconut to figure out wic block k is in. Once te first coconut breaks, it takes at most 1 more drops to find te value of k. more, te coconut will crack, and wenever a coconut is dropped from a eigt of less tan k inces, te coconut will be completely undamaged and it will be as if we ad not dropped te coconut at all. (Tus, we could drop te coconut from a eigt of k 1 inces a million times and noting would appen.) Our goal is to find k wit as few drops as possible, given a certain number of test coconuts wic cannot be reused after tey crack. If we ave one coconut, ten clearly we must first try 1 inc, ten 2 inces, and so on. Te riddle asks, wat is te best way to proceed if we ave two coconuts? An approximate answer to te riddle is given by te following strategy: Divide te n-inc range into b blocks of eigt eac (we will coose b and later; see Figure 15.1). Drop te first coconut from eigt, ten from eigt 2, and so on, until it cracks. Say te first coconut cracks at eigt b 0. Ten drop te second coconut from eigts (b 0 1) + 1, (b 0 1) + 2, and so on, until it cracks. Tis metod requires at most b + 1 = n + 1 drops, wic is minimized wen b = = n. Notice tat, once te first coconut cracks, our problem becomes identical to te one-coconut version (except tat instead of looking for a number between 1 and n, we are looking for a number between (b 0 1) +1 and b 0 1). Similarly, if we started wit tree coconuts instead of two, ten it would be a good idea to divide te range {1,..., n} into equally-sized blocks and execute te solution to te two-cononut problem once te first coconut cracked. Exercise 15.1. If we use te above strategy to solve te tree-coconut problem, wat size sould we coose for te blocks? (Hint: it is not n.) 15.2 Implementation: A Recursive Data Structure Just as in te two-coconut problem, te van Emde Boas data structure divides te range {0,..., n 1} into blocks of size n, wic we call clusters. Eac cluster is itself a veb structure of size n. In addition, tere is a summary structure tat keeps track of wic clusters are nonempty (see Figure Lec 15 pg. 2 of 6

veb: size = 16 max = 13 min = 2 summary = clusters = veb of size 4 representing {0,1,3} veb of size 4 representing {2,3} NIL veb of size 4 representing {1,3} veb of size 4 representing {1} Figure 15.2. A veb structure of size 16 representing te set {2,3,5,7,13} {0,...,15}. 15.2). Te summary structure is analogous to te first coconut, wic told us wat block k ad to lie in. 15.2.1 Crucial implementation detail Te following implementation detail, wic may seem unimportant at first, is actually crucial to te performance of te van Emde Boas data structure: Do not store te minimum and maximum elements in clusters. Instead, store tem as separate data fields. Tus, te data fields of a veb structure V of size n are as follows: V.size te size of V, namely n V.max te maximum element of V, or NIL if V is empty V.min te minimum element of V, or NIL if V is empty V.clusters an array of size n wic stores te clusters. For performance reasons, te value stored in eac entry of V.clusters will initially be NIL; we will wait to build eac cluster until we ave to insert someting into it. V.summary a van Emde Boas structure of size n tat keeps track of wic clusters are nonempty. As wit te entries of V.clusters, we initially set V.summary NIL; we do not build te recursive van Emde Boas structure referenced by V.summary until we ave to insert someting into it (i.e., until te first time we create a cluster for V ). 15.2.2 Insertions To simplify te exposition, in tis lecture we use a model of computation in wic it takes constant time to initialize any array (setting all entries equal to NIL), no matter ow big te array is. 2 Tus, 2 Of course, tis is ceating; real computers need an initialization time tat depends on te size of te array. Still, tere are use cases for wic tis assumption is warranted. We can preload our veb structure by creating all possible veb Lec 15 pg. 3 of 6

it takes constant time to create an empty van Emde Boas structure, and it also takes constant time to insert te first element into a van Emde Boas structure: Algoritm: VEB-FIRST-INSERTION(V, x) 1 V.min x 2 V.max x Using tis fact, we will sow tat te procedure V.INSERT(x) as only one non constant-time step. Say V.clusters[i] is te cluster corresponding to x (for example, if n = 100 and x = 64, ten i = 6). Ten: If V.clusters[i] is NIL, ten it takes constant time to create a veb structure containing only te element corresponding to x. (For example, if n = 100 and x = 64, ten it takes constant time to create a veb structure of size 10 containing only 4.) We update V.clusters[i] to point to tis new veb structure. Tus, te only non constant-time operation we ave to perform is to update V.summary to reflect te fact tat V.clusters[i] is now nonempty. If V.clusters[i] is empty 3 (we can ceck tis by cecking weter V.clusters[i].min = NIL), ten it takes constant time to insert te appropriate entry into V.clusters[i]. So again, te only non constant-time operation we ave to perform is to update V.summary to reflect te fact tat V.clusters[i] is now nonempty. If V.clusters[i] is nonempty, ten we ave to make te recursive call V.clusters[i].INSERT(x). However, we do not need to make any canges to V.summary. In eac case, we find tat te running time T of INSERT satisfies te recurrence T(n) = T ( n ) + O(1). (15.1) As we will see in 15.3 below, te solution to tis recurrence is T INSERT = Θ(lglg n). 15.2.3 Deletions Similarly, te procedure V.DELETE(x) requires only one recursive call. To see tis, let i be as above. Ten: First suppose V as no nonempty clusters (we can ceck tis by cecking weter V.summary is eiter NIL or empty; te latter appens wen V.summary.min = NIL). If x is te only element of V, ten we simply set V.min V.max NIL. Tus, deleting te only element of a single-element veb structure takes constant time; we will use tis fact later. Oterwise, V contains only two elements including x. Let y be te oter element of V. If x = V.min, ten y = V.max and we set V.min y. If x = V.max, ten y = V.min and we set V.max y. substructures up front rater tan using NIL for te empty ones. Tis way, after an initial O(n) preloading time, array initialization never becomes an issue. Anoter possibility is to use a dynamically resized as table for V.clusters rater tan an array. Eac of tese possible improvements as its sare of extra details tat must be addressed in te running time analysis; we ave cosen to ignore initialization time for te sake of brevity and readability. 3 Tis would appen if V.clusters[i] was once nonempty, but became empty due to deletions. Lec 15 pg. 4 of 6

Next suppose V as some nonempty clusters. If x = V.min, ten we will need to update V.min. Te new minimum takes only constant time to calculate, toug: it is y, were y V.clusters[V.summary.min].min; in oter words, it s te smallest element in te lowest nonempty cluster of V. After making tis update, we need to make a recursive call to V.clusters[V.summary.min].DELETE( y) to remove te new value of V.min from its cluster. At tis point, tere are two possibilities: * y is not te only element in its cluster. Ten V.summary does not need to be updated. * y is te only element in its cluster. Ten we must make a recursive call to V.summary.DELETE(V.summary.min) to reflect te fact tat y s cluster is now empty. However, since y was te only element in its cluster, it took constant time to remove y from its cluster: as we said above, it takes only constant time to remove te last element from a one-element veb structure. Eiter way, tere is only one non constant-time step in V.DELETE: a recursive call to DELETE on a veb structure of size n. By entirely te same argument (peraps in mirror-image), we find tat te case x = V.max as identical running time to te case x = V.min. If x is neiter V.min nor V.max, ten we must delete x from its cluster and, if tis causes x s cluster to become empty, make a recursive call to V.summary.DELETE to reflect tis update. As above, te recursive call to V.summary.DELETE will only appen wen te deletion of x from its cluster took constant time. In eac case, te only non constant-time step in DELETE is a single recursive call to DELETE on a veb structure of size n. Tus, te running time T for DELETE satisfies (15.1), wic we repeat ere for convenience: T(n) = T ( n ) + O(1). (copy of 15.1) Again, te solution is T DELETE = Θ(lglg n). 15.2.4 Lookups Finally, we consider te operation V.LOOKUP(x), wic returns TRUE or FALSE according as x is or is not in V. Te implementation is easy: First we ceck weter x = V.min, ten weter x = V.max. If neiter of tese is true, ten we recursively call LOOKUP on te cluster corresponding to x. Tus, te running time T of LOOKUP satisfies (15.1), and we ave T LOOKUP = Θ(lglg n). Exercise 15.2. Go troug tis section again, circling eac step or claim tat relies on te decision not to store V.min and V.max in clusters. Be careful tere may be more tings to circle tan you tink! Lec 15 pg. 5 of 6

15.3 Solving te Recurrence As promised, we now solve te recurrence (15.1), wic we repeat ere for convenience: T(n) = T ( n ) + O(1). (copy of 15.1) 15.3.1 Base case Before we begin, it s important to note tat we ave to lay down a base case at wic recursive structures stop occurring. Matematically tis is necessary because one often uses induction to prove solutions to recurrences. From an implementation standpoint, te need for a base case is obvious: ow could a veb structure of size 2 make good use of smaller veb substructures? So we will lay down n = 2 as our base case, in wic we simply take V to be an array of two bits. 15.3.2 Solving by descent Te equation (15.1) means tat we start on a structure of size n, ten pass to a structure of size n = n 1/2, ten to a structure of size n = n 1/4, and so on, spending a constant amount of time at eac level of recursion. So te total running time sould be proportional to te number of levels of recursion before arriving at our base case, wic is te number l suc tat n 1/2l = 2. Solving for l, we find l = lglg n. Tus T(n) = Θ(lglg n). 15.3.3 Solving by substitution Anoter way to solve te recurrence is to make a substitution wic reduces it to a recurrence tat we already know ow to solve. Let T (m) = T ( 2 m). Taking m = lg n, (15.1) can be rewritten as T (m) = T (m/2) + O(1), wic we know to ave solution T (m) = Θ(lg m). Substituting back n = 2 m, we get T(n) = Θ(lglg n). Lec 15 pg. 6 of 6

MIT OpenCourseWare ttp://ocw.mit.edu 6.046J / 18.410J Design and Analysis of Algoritms Spring 2012 For information about citing tese materials or our Terms of Use, visit: ttp://ocw.mit.edu/terms.