Linear Time Selection

Chapter 2 Linear Time Selection Definition 1 Given a sequence x 1, x 2,..., x n of numbers, the ith order statistic is the ith smallest element. Maximum or Minimum When i = 1, it is the minimum element and when i = n it is the maximum element. We can find either in n 1 comparisons by doing a linear search of the list, from left to right or from right to left. So, n 1 comparisons are sufficient. These many comparisons are also necessary. Think of the process determinimg the maxiumum (minimum), for example, as a tournament in which each comparison is a match to determine the winner (loser). The maximum (minimum) element is the champion-winner (champion-loser). Snce the champion does not lose (win) a match and everyone else must lose (win) at least one match, n 1 comparisons are also necessary. Cormen, Ex 9.1-1, page 185 Solution: To determine the second smallest element or the 2nd order statistic, we first determine the champion-loser. We do this by constructing a (binary) tournament tree of height log n for n players. This is esaily done by cnstructing a left tournament tree on n/2 elements and a right tournament tree on the remaining n/2 elements, continuing the construction recursively for both the subtrees. Since this tree has n leaves, there are n 1 internal nodes and hence as many matches have been played. Now we determine the second largest by determining the champion loser from among at most log n players who have lost to the champion loser. Hence the second smallest can be found in at most n + log n 2 comparisons. An interesting question is whether these many comparisons are necessary. We will discuss this later on in the course. Here s an interesting problem related to the linear search for the minimum (or maximum) element, discussed above. Problem 1 Show that during this linear search the expected number of times the the miniumum (or maximum) value is reset is Θ(log n). 1

Solution Let the elements be x 1, x 2, x 3,..., x i, x i+1,..., x n. Let the indicator random variable X i = 1 if the maximum is reset on examining the x i element and 0 otherwise. Thus X = Σ n i=1 X i is the number of times the maximum is reset. Now, by linearity of the expectation operator, E[X] = Σ n i=1 E[X i] is the expected value of the random variable X, where E[X i ] = p i, the probability that on examining x i, the maximum is reset. Suppose we have examined the elements from x 1 to x i 1. The next element x i will reset the maximum if it is the largest of the elements from x 1 to x i. Since all permutations of the first i values are equally likely, any one of the first i values can occur in the ith spot, and, in particular, the maximum, with a probabilty of 1/i. Thus E[X] = Σ n i=11/i. Hence E[X] = Θ(log n). Maximum and Minimum Using the above algorithm we can find both the maximum and the minimum in 2n 2 comparisons. The following cleverer algorithm makes use of at most 3 n/2 comparisons. If n odd, we set the first element to both the maximum and minimum. The remaining elements we compare in pairs, using the smaller (larger) one to reset the minumum (maximum), if necssary. Thus 3 comparisons for (n 1)/2 pairs, for a total of 3 n/2 comparisons. If n is even, we set the smaller of the first pair to be the minimum and the larger to the maximum. We then proceed as in the odd case, for a total of 1 + 3 (n/2 1) or 3 n/2 2 comparisons. Hence at most 3 n/2 comparisons are sufficient in all cases. We can improve on the number of comparisons by maintaining two disjoint sets S 1 and S 2 of potential maxima and potential minimum respectively. Initially, S 1 and S 2 are empty. We compare the elements in pairs, adding the smaller to S 1 and the larger to S 2. When we have exhausted all the elements or have just one left, we determine the minimum of S 1 and the maximum of S 2. If there is no leftover element, these are respectively the minimum and maximum of S; else we compare these with the leftover element to determine if these need to be reset. Let n = 2k. Then the number of comparisons made is 3k 2 = 3n/2 2. If n = 2k + 1 then the number of comparisons made is 3k = 3n/2 2. This solves Ex 9.1-2 in Cormen partly. The other part is the following problem. Problem 2 Show that 3n/2 2 comparisons are necessary Problem 3 It is possible to solve this problem by a divide-and-conquer approach. We divide the list into two nearly equal halves and apply the algorithm recursively to the two halves. Let max l (max r ) and min l (min r ) be the maximum and minimum elements of the left (right) half. Then max(max l, max r ) is the maximum of the entire list and min(min l, min r ) is the minimum. If T(n) is the number of comparisons made on a list of size n, then T(n) = T( n/2 ) + T( n/2 ) + 2, n 3 Determine a closed form expression for T(n), assuming that T(2) = 1 and T(1) = 0. c Dr. Asish Mukhopadhyay 2

Randomized Selection It s similar to randomized quicksort in the sense that we choose a random element with respect to which we partition the input array into a left subarray and a right subarray, locate the subarray that contains the ith order statistic and continue recursively with this subarray. RANDOMIZED SELECT(A, p, r, i) 1. if p = r 2 then return A[p] 3. q RANDOMIZED PARTITION(A, p, r) 4. k q p + 1 5. if i = k 6. then return A[q] 7. elseif i < k 8. then return RANDOMIZED SELECT(A, p, q 1, i) 9. else return RANDOMIZED SELECT(A, q + 1, r, i k) Cormen, Ex 9.2-1 This is obvious since k 1 always. When k = 1(= r p + 1), we either return the pivot as answer or continue with the right (left) subarray of non-zero length. Also note that the recursion bottoms out when the array is of size 1. The analysis is interesting. Let T(n) be the time to find the ith order statistic of the input array A[1..n]. The time T(n) is a random variable, since any element A[i] can be chosen as the pivot element. The probability of this is 1/n (which is also the the expected value, E[X k ], of the random variable defined below). We are interested in the expected (or mean) value of T(n). Let X k be a random variable (also called indicator random variable) defined such that: Then, X k = 1 if A[1..q] has k elements X k = 0 otherwise T(n) = Σ n k=1 X k max(t(k 1), T(n k)) + Θ(n), (2.1) where we have assumed that the ith order statistic is in the larger subarray, created by the partition. Applying the expectation operator E[] to both sides of the above equation, we get, from the linearity of this operator, and the independence of the random variables, X k and T(n) that: E[T(n)] = Σ n k=1 E[X k] E[max([T(k 1)], T(n k))] + Θ(n), = Σ n k=1probability (pivot is k th smallest of A[1..n]) E[max(T(k 1), T(n k))] + Θ(n) = Σ n k=11/n E[max(T(k 1), T(n k))] + Θ(n), where we have made the simplifying assumption that the larger subarray contains the ith order statistic. If we assume that T(n) is monotone increasing, then T(k 1) > T(n k) for k 1 > n k or k > (n + 1)/2, while T(k 1) < T(n k) for k 1 < n k and T(k 1) = T(n k) for k = (n + 1)/2 or n/2. Thus the terms in the sum being equal around this middle value, which exists when n is odd, c Dr. Asish Mukhopadhyay 3

E[T(n)] = 2/n Σ n 1 k= n/2 E[T(k)] + Θ(n) = 2/n Σ n 1 k=1e[t(k)] 2/n Σ n/2 1 k=1 E[T(k)] + Θ(n) Assume that E[T(n)] cn for a suitable constant c > 0. Then from the above we have that: E[T(n)] 2/n c n(n 1)/2 2/n c ( n/2 1) n/2 /2 + Θ(n) 2/n c n(n 1)/2 2/n c (n/2 1)(n/2 2)/2 + Θ(n) c (n 1) c (n/2 3)/2 + Θ(n) c n c (n/2 1)/2 + Θ(n) c n, provided c (n/2 1)/2 > a n, where we have replaced the Θ(n) term by a n, for some constant a. A small rearrangement reduces the inequality to n (c/4 a) c/2 > 0, so that E[T(n)] c n for n > (c/2)/(c/4 a). By choosing c > 4a, we can set n 0 = 2c/(c 4a). Thus E[T(n)] = O(n) and therefore E[T(n)] = Θ(n) since T(n) has a trivial lower bound of Ω(n). Deterministic Selection A clever deterministic method is used to choose the pivot to lie in the shaded region shown in Fig. 2.1 below. The implication of this is that no matter whether the ith order statistic lies to the left of the pivot or to its right, we are sure to prune the size of the input set by approximately a quarter in each partitioning step. This is a beautiful example of the prune-and-search pardigm. approx n/4 elements < pivot approx n/4 elements > pivot pivot lies somewhere here Figure 2.1: Where the pivot lies This is how it is done. We make groups of 5 out of the n input elements. If n is not a multiple of 5, there is one residual group with up to 4 elements. Choose the median element of each group to give us n/5 elements. We find the median m of these n/5 medians, using this same algorithm; setting m to be the pivot we continue recursively with the subarray that contains the ith order statistic. For the complexity analysis, let us establish a lower bound on the number of elements that are greater than the pivot, m. Barring the group that contains m and the residual group, 1/2 n/5-2 groups have 3 elements greater than m. Thus at least 3n/10 6 elements are greater than the pivot. Similarly, we can establish that at least 3n/10 6 elements are smaller than the pivot. We note that 3n/10 6 n/4 for n 120 and, a fortiori, for n 140. This solves Ex 9.3-2, page 192. If T(n) is the worst-case complexity of this deterministic selection algorithm, then T(n) T(7n/10 + 6) + T( n/5 ) + Θ(n) since we continue recursively with a subarray of size at most 7n/10 + 6, and also call the algorithm recursively to determine the median of a group of n/5 elements. c Dr. Asish Mukhopadhyay 4

We use the substitution method again to show that T(n) c n for n n 0, for suitable choices of c and n 0 ; for n < n 0, T(n) = O(1). We have T(n) c (7n/10 + 6) + c ( n/5 ) + a n c (7n/10 + 6) + c (n/5 + 1) + a n c n + c ( 3n/10 + 6) + c (n/5 + 1) + a n c n + (c n/10 + 7 c + a n) c n, provided c n/10 + 7 c + a n 0 or c a n/(n/10 7). We can satisfy this last inequality by letting c 20a and 20a 10a n/(n 70). From the latter inequality it follows that n 140. Thus choosing n 0 = 140 and c 20a, we satisfy T(n) c n. Note that for the above analysis to go through we must have 7/10+1/5 = 9/10 < 1. If we choose to make groups of 3, then these fractions are 2/3 and 1/3, adding up to 1, in which case the analysis dose not go through. Groups of 7 also work since in this case the fractions are 5/7 and 1/7. This solves exercise 9.3-1, page 192. Knowing how to find an order staistic in O(n) worst-case time, helps us do Quicksort in O(n log n) worst-case time since this is the solution to the recurrence This solves Cormen, page 192, Ex 9.3-3 T(n) = 2 T(n/2) + O(n) The solution to Cormen, page 192, Ex. 9.3-4 is also easily obtained, using this algorithm. At each step of the deterministic selection algorithm, the status of at least n/4 elements with respect to the ith order statistic are resolved in the sense that these are known to be either greater than or smaller than the ith order statistic. Thus we successively resolve the status of at least 1/4 n, 1/4 3n/4, 1/4 3/4 3n/4,... elements. This gives us an upper bound of 1/4 1/(1 3/4) n = n elements. Thus repeating the step enough times till a subarray of constant size is left, we resolve the status of all elements except for this subarray of constant size. Since we now determine the ith order statistic by brute-force, the status of the remaining elements are simultaneously resolved. Cormen, Ex 9.3-5, page 192 is also easily solved. After each median-finding step, the ith order statistic is located in one of the two half-sized subarrays. Thus the complexity is that of the sum: Cormen, Order Statistics, page 192, 9.3-7 c n + c n/2 + c n/4 +... = 2 c n We first find the median of the n numbers. To determine the k numbers that are closest to the median, we find the absolute distances of the remaining numbers from the median, and find the kth smallest of these. These are then used to find the k closest to the median. Cormen, Order Statistics, page 192, 10.3-8 c Dr. Asish Mukhopadhyay 5

Given two sorted arrays X[] and Y [], each of size n, find their median in O(log n). This problem is interesting. We first compare the x n/2 th element (median) of X[], with the y n/2 th element (median) of Y []. Suppose y n/2 > x n/2. Then between the two arrays there are more than n elements that are less than y n/2, Thus we can prune from consideration the n/2 elements of Y [] that are greater than y n/2. Next, we compare the x 3n/4 th element of X[] with the y n/4 th element of Y []. Let x 3n/4 > y n/4. In this case, we prune the elements of X[] that are greater than x 3n/4 since between the two arrays we have more than n elements that are less than x 3n/4. Next, we compare y 3n/8 with x 5n/8. Thus in O(log n) steps we prune n elements to obtain the median. We terminate when the intervals of indetermination are of size 1, when we determine the median by brute-force. COMBINED-MEDIAN(X[p,q], Y[r,s]) // q-p+1 = s-r+1 1. if (p=q) and (r=s) 2. then return max(x[p], Y[r]) 3. elseif X[(p+q)/2] > Y[(r+s)/2] 4. return COMBINED-MEDIAN(X[p,(p+q)/2], Y[(r+s)/2, s]) //search space reduce by half 5. else 6. return COMBINED-MEDIAN(X[(p+q)/2, q], Y[r,(r+s)/2]) //search space reduce by half Cormen, 192, 9.3-9 Suppose there is just one oilfield. The optimal solution is one in which the east-west line goes through the oilfield. When there are two oilfields, the east-west line can have any y-value that lies between the two oilfields. The sum of the distances is always the distance betwen the oilfields. Thus when there are n oilfields, and n is odd the best solution is obtained when the east-west line goes through the oil-well with the median y-value. For n even, the optimal solution is obtained when the eaest west line goes between the wells with the n/2th y-coordinate and the n/2 + 1th y-coordinate values. c Dr. Asish Mukhopadhyay 6