# The Effectiveness of Lloyd-Type Methods for the k-means Problem

Save this PDF as:

Size: px
Start display at page:

## Transcription

2 data in its Voronoi region and is located at the center of mass of this data. Perhaps the most popular heuristic used for this problem is Lloyd s method, which consists of the following two phases: (a) Seed the process with some initial centers (the literature contains many competing suggestions of how to do this); (b) Iterate the following Lloyd step until the clustering is good enough : cluster all the data in the Voronoi region of a center together, and then move the center to the centroid of its cluster. Although Lloyd-style methods are widely used, to our knowledge, there is no known mathematical analysis that attempts to explain or predict the performance of these heuristics on high-dimensional data. In this paper, we take the first such step. We show that if the data is well-clusterable according to a certain clusterability or separation condition (that we introduce and justify), then various Lloydstyle methods do indeed perform well and return a provably near-optimal clustering. Our contributions are threefold: () We introduce a separation condition and justify it as a reasonable abstraction of well-clusterability for the analysis of k-means clustering algorithms. Our condition is simple, and abstracts a notion of well-clusterability alluded to earlier: letting k (X) denote the cost of an optimal k-means solution of input X, we say that X is ɛ-separated for k- means if k (X)/ k (X) ɛ. (A similar condition for k = was used for l edge-cost clustering in [43].) One motivation for proposing this condition is that a significant drop in the k-clustering cost is already used in practice as a diagnostic for choosing the value of k (see [4] 0.0). Furthermore, we show (in Section 5) that: (i) The data satisfies our separation condition if and only if it satisfies the other intuitive notion of well-clusterability suggested earlier, namely that any two low-cost k-clusterings disagree on only a small fraction of the data; and (ii) The condition is robust under noisy (even adversarial) perturbation of the data. () We present a novel and efficient sampling process for seeding Lloyd s method with initial centers, which allows us to prove the effectiveness of these methods. (3) We demonstrate the effectiveness of (our variants of) the Lloyd heuristic under the separation condition. Specifically: (i) Our simplest variant uses only the new seeding procedure, requires a single Lloyd-type descent step, and achieves a constant-factor approximation in time linear in X. This algorithm has success probability exponentially small in k, but we show that (ii) a slightly more complicated seeding process based on our sampling procedure yields a constant-factor approximation guarantee with constant probability, again in linear time. Since only one run of seeding+descent is required in both algorithms, these are candidates for being faster in practice than currently used Lloyd variants, which are used with multiple re-seedings and many Lloyd steps per re-seeding. (iii) We also give a PTAS by combining our seeding process with a sampling procedure of Kumar, Sabharwal and Sen [30], whose whose running time is linear in X and the dimension, and exponential in k. This PTAS is significantly faster, and also simpler, than the PTAS of Kumar et al. [30] (applying the separation condition to both algorithms; the latter does not run faster under the condition). Literature and problem formulation. Let X R d be the given point set and n = X. In the k-means problem, the objective is to partition X into k clusters X,..., X k and assign each point in every cluster Xi to a common center c i R d, so as to minimize the k-means cost k i= x X i x c i, where. denotes the l norm. We let k (X) denote the optimum k-means cost. Observe that given the centers c,..., c k, it is easy to determine the best clustering corresponding to these centers: cluster X i simply consists of all points x X for which c i is the nearest center (breaking ties arbitrarily). Conversely given a clustering X i,..., X k, the best centers corresponding to this clustering are obtained by setting c i to be the center of mass (centroid) of cluster X i, that is, setting c i = X x i X i x. It follows that both of these properties simultaneously hold in an optimal solution, that is, c i is the centroid of cluster X i, and each point in X i has c i as its nearest center. The problem of minimizing the k-means cost is one of the earliest and most intensively studied formulations of the clustering problem, both because of its mathematical elegance and because it bears closely on statistical estimation of mixture models of k point sources under spherically symmetric Gaussian noise. We briefly survey the most relevant literature here. The k-means problem seems to have been first considered by Steinhaus in 956 [46]. A simple greedy iteration to minimize cost was suggested in 957 by Lloyd [3] (and less methodically in the same year by Cox [9]; also apparently by psychologists between [47]). This and similar iterative descent methods soon became the dominant approaches to the problem [35, 33,, 3] (see also [9, 0, 4] and the references therein); they remain so today, and are still being improved [, 40, 4, 8]. Lloyd s method (in any variant) converges only to local optima however, and is sensitive to the choice of the initial centers [38]. Consequently, a lot of research has been directed toward seeding methods that try to start off Lloyd s method with a good initial configuration [8, 9, 7, 3, 44, 5, 36, 4]. Very few theoretical guarantees are known about Lloyd s method or its variants. The convergence rate of Lloyd s method has recently been investigated in [0,, ] and in particular, [] shows that Lloyd s method can require a superpolynomial number of iterations to converge. The k-means problem is NP-hard even for k = [3].

3 Recently there has been substantial progress in developing approximation algorithms for this problem. Matoušek [34] gave the first PTAS for this problem, with running time polynomial in n, for a fixed k and dimension. Subsequently a succession of algorithms have appeared [39, 4,, 5, 6,, 30] with varying runtime dependency on n, k and the dimension. The most recent of these is the algorithm of Kumar, Sabharwal and Sen [30], which presents a linear time PTAS for a fixed k. There are also various constantfactor approximation algorithms for the related k-median problem [6, 7, 6, 5, 37], which also yield approximation algorithms for k-means, and have running time polynomial in n, k and the dimension; recently Kanungo et al. [7] adapted the k-median algorithm of [3] to obtain a (9 + ɛ)- approximation algorithm for k-means. However, none of these methods match the simplicity and speed of the popular Lloyd s method. Researchers concerned with the runtime of Lloyd s method bemoan the need for n nearest-neighbor computations in each descent step [8]! Interestingly, the last reference provides a data structure that provably speeds up the nearest-neighbor calculations of Lloyd descent steps, under the condition that the optimal clusters are well-separated. (This is unrelated to providing performance guarantees for the outcome.) Their data structure may be used in any Lloyd-variant, including ours, and is well suited to the conditions under which we prove performance of our method; however, ironically, it may not be worthwhile to precompute their data structure since our method requires so few descent steps.. Preliminaries We use the following notation throughout. For a point set S, we use ctr(s) to denote the center of mass of S. Let partition X X k = X be an optimal k-means clustering of the input X, and let c i = ctr(x i ) and c = ctr(x). So k (X) = k i= x X i x c i = k i= (X i ). Let n i = X i, n = X, and ri = (Xi) n i, that is, ri is the mean squared error in cluster X i. Define D i = min j i c j c i. We assume throughout that X is ɛ-separated for k-means, that is, k (X) ɛ k (X), where 0 < ɛ ɛ 0 with ɛ 0 being a suitably small constant. We use the following basic lemmas quite frequently. Lemma. For every x, y X x y = (X) + n x c. Hence {x,y} X x y = n (X). Lemma. Let S R d, and S S be a partition of S with S. Let s = ctr(s), s = ctr(s ), and s = ctr(s ). Then, (i) (S) = (S ) + (S ) + S S S s s, and (ii) s s (S) S S S. Proof: Let a = S and b = S = S S. We have (S) = x S x c + x S x c = ( (S ) + a s s ) + ( (S ) + b s s ) = (S ) + (S ) + ab a+b s s. The last equality follows from Lemma. by noting that s is also the center of mass of the point set where a points are located at s and b points are located at s, and so the optimal -means cost of this point set is given by a s s + b s s. This proves part (i). Part (ii) follows by substituting s s = s s b/(a + b) in part (i) and dropping the (S ) and (S ) terms. 3. The -means problem We first consider the -means case. We assume that the input X is ɛ-separated for -means. We present an algorithm that returns a solution of cost at most ( + f(ɛ) ) (X) in linear time, for a suitably defined function f that satisfies lim ɛ 0 f(ɛ) = 0. An appealing feature of our algorithm is its simplicity, both in description and analysis. In Section 4, where we consider the k-means case, we will build upon this algorithm to obtain both a linear time constant-factor (of the form + f(ɛ)) approximation algorithm and a PTAS with running time exponential in k, but linear in n, d. The chief algorithmic novelty in our -means algorithm is a non-uniform sampling process to pick two seed centers. Our sampling process is very simple: we pick the pair x, y X with probability proportional to x y. This biases the distribution towards pairs that contribute a large amount to (X) (noting that n (X) = {x,y} X x y ). We emphasize that, as improving the seeding is the only way to get Lloyd s method to find a high-quality clustering, the topic of picking the initial seed centers has received much attention in the experimental literature (see, e.g., [4] and references therein). However, to the best of our knowledge, this simple and intuitive seeding method is new to the vast literature on the k-means problem. By putting more weight on pairs that contribute a lot to (X), the sampling process aims to pick the initial centers from the cores of the two optimal clusters. We define the core of a cluster precisely later, but loosely speaking, it consists of points in the cluster that are significantly closer to this cluster-center than to any other center. Lemmas 3. and 3. make the benefits of this approach precise. Thus, in essence, we are able to leverage the separation condition to nearly isolate the optimal centers. Once we have the initial centers within the cores of the two optimal clusters, we show that a simple Lloyd-like step, which is also simple to analyze, yields a good performance guarantee: we consider a

4 suitable ball around each center and move the center to the centroid of this ball to obtain the final centers. This ballk-means step is adopted from Effros and Schulman [6], where it is shown that if the k-means cost of the current solution is small compared to k (X) (which holds for us since the initial centers lie in the cluster-cores) then a Lloyd step followed by a ball-k-means step yields a clustering of cost close to k (X). In our case, we are able to eliminate the Lloyd step, and show that the ball-k-means step alone guarantees a good clustering.. Sampling. Randomly select a pair of points from the set X to serve as the initial centers, picking the pair x, y X with probability proportional to x y. Let ĉ, ĉ denote the two picked centers.. Ball-k-means step. For each ĉ i, consider the ball of radius ĉ ĉ /3 around ĉ i and compute the centroid c i of the portion of X in this ball. Return c, c as the final centers. Running time The entire algorithm runs in time O(nd). Step clearly takes only O(nd) time. We show that the sampling step can be implemented to run in O(nd) time. Consider the following two-step sampling procedure: (a) first pick center ĉ by choosing a point x X with probability equal to Py X x y Px,y X x y = ( (X) + n x c ) /n (X) (using Lemma.); (b) pick the second center by choosing point y X with probability equal to y ĉ / ( (X) + n c ĉ ). This two-step sampling procedure is equivalent to the sampling process in step, that is, it picks pair x, x X with probability x x. Each step takes only O(nd) time since P{x,y} X x y (X) can be precomputed in O(nd) time. Analysis The analysis hinges on the important fact that under the separation condition, the radius r i of each optimal cluster is substantially smaller than the inter-cluster separation c c (Lemma 3.). This allows us to show in Lemma 3. that with high probability, each initial center ĉ i lies in the core (suitably defined) of a distinct optimal cluster, say X i, and hence c c is much larger than the distances ĉ i c i for i =,. Assuming that ĉ, ĉ lie in the cores of the clusters, we prove in Lemma 3.3 that the ball around ĉ i contains only, and most of the mass of cluster X i, and therefore the centroid c i of this ball is very close to c i. This in turn implies that the cost of the clustering around c, c is small. Lemma 3. max(r, r ) ɛ c c. Proof: By part (i) of Lemma. we have (X) = (X) + nn n c c n which is equivalent to n n (X) = c c (X) (X) (X). r n n + r n n ɛ c c. This implies that Let = 00ɛ. We require that <. We define the core of cluster X i as the set Xi cor = { } x X i : x c i r i. By Markov s inequality, Xi cor ( )n i for i =,. Lemma 3. Pr [ĉ, ĉ lie in distinct cores] = O(). Proof: To simplify our expressions, we assume that the points are scaled by c. We have c (X) = (X) + nn n c c by part (i) of Lemma., so (X) nn n( ). Let c i = ctr(xcor i ). By part (ii) of Lemma. (with S = X i, S = Xi cor ), c i c i r i. The probability of the stated event is A/B where A = x X x y = X cor,y Xcor cor (X cor ) + X cor (X cor ) + X cor X cor c c, and B = {x,y} X x y = n (X) nn. By the bounds on c i c i and Lemma 3., we get c c ɛ ( )( ). So A/B = ( O() ). So we may assume that each initial center ĉ i lies in X cor i. Let ˆd = ĉ ĉ and B i = {x X : x ĉ i ˆd/3}. Recall that c i is the centroid of B i. Lemma 3.3 For each i, we have Xi cor B i X i. Hence, c i c i r i. Proof: By Lemma 3. and the definition of Xi cor, we have ɛ ĉ i c i θ c c for i =, where θ = ( ) 0. So 4 5 ˆd c 6 c 5. For any x B i we have x c i ˆd 3 + ĉ i c i c c, so x X i. For any x Xi cor, x ĉ i θ c c ˆd 3, so x B i. Now by part (ii) of Lemma., with S = X i and S = B i, we get c i c i r i since B i Xi cor. Theorem 3.4 The above algorithm returns a clustering of cost at most (X) with probability at least O() in time O(nd), where = Θ(ɛ ). Proof: The cost of the solution is at most i x X i x c i = ( i (X i ) + n i c i c i ) (X) 4. The k-means problem. We now consider the k-means setting. We assume that k (X) ɛ k (X). We describe a linear time constantfactor approximation algorithm, and a PTAS that returns a

5 ( + ω)-optimal solution in time O ( O(k/w) nd ). The algorithms consist of various ingredients, which we describe separately first for ease of understanding, before gluing them together to obtain the final algorithm. Conceptually both algorithms proceed in two stages. The first stage is a seeding stage, which performs the bulk of the work and guarantees that at the end of this stage there are k seed centers positioned at nearly the right locations. By this we mean that if we consider distances at the scale of the inter-cluster separation, then at the end of this stage, each optimal center has a (distinct) initial center located in close proximity this is precisely the leverage that we obtain from the k-means separation condition (as in the -means case). We employ three simple seeding procedures, with varying time vs. quality guarantees, that exploit this separation condition to seed the k centers at locations very close to the optimal centers. In Section 4.., we consider a natural generalization of the sampling procedure used for the -means case, and show that this picks the k initial centers from the cores of the optimal clusters. This sampling procedure runs in linear time but it succeeds with probability that is exponentially small in k. In Section 4.., we present a very simple deterministic greedy deletion procedure, where we start off with all points in X as the centers and then greedily delete points (and move centers) until there are k centers left. The running time here is O(n 3 d). Our deletion procedure is similar to the reverse greedy algorithm proposed by Chrobak, Kenyon and Young [8] for the k-median problem. Chrobak et al. show that their reverse greedy algorithm attains an approximation ratio of O(log n), which is tight up to a factor of log log n. In contrast, for the k- means problem, if k (X) ɛ k (X), we show that our greedy deletion procedure followed by a clean-up step (in the second stage) yields a ( + f(ɛ) ) -approximation algorithm.finally, in Section 4..3 we combine the sampling and deletion procedures to obtain an O(nkd + k 3 d)-time initialization procedure. We sample O(k) centers, which ensures that every cluster has an initial center in a slightly expanded version of the core, and then run the deletion procedure on an instance of size O(k) derived from the sampled points to obtain the k seed centers. Once the initial centers have been positioned sufficiently close to the optimal centers, we can proceed in two ways in the second-stage (Section 4.). One option is to use a ball-k-means step, as in -means, which yields a clustering of cost ( + f(ɛ) ) k (X) due to exactly the same reasons as in the -means case. Thus, combined with the initialization procedure of Section 4..3, this yields a constant-factor approximation algorithm with running time O(nkd + k 3 d). The entire algorithm is summarized in Section 4.3. The other option, which yields a PTAS, is to use a sampling idea of Kumar et al. [30]. For each initial center, we compute a list of candidate centers for the corresponding optimal cluster as follows: we sample a small set of points uniformly at random from a slightly expanded Voronoi region of the initial center, and consider the centroid of every subset of the sampled set of a certain size as a candidate. We exhaustively search for the k candidates (picking one candidate per initial center) that yield the least cost solution, and output these as our final centers. The fact that each optimal center c i has an initial center in close proximity allows us to argue that the entire optimal cluster X i is contained in the expanded Voronoi region of this initial center, and moreover that X i is a significant fraction of the total mass in this region. Given this property, as argued by Kumar et al. (Lemma.3 in [30]), a random sample from the expanded Voronoi region also (essentially) yields a random sample from X i, which allows us to compute a good estimate of the centroid of X i, and hence of (X i ). We obtain a ( + ω)-optimal solution in time O ( O(k/ω) nd ) with constant probability. Since we incur an exponential dependence on k anyway, we just use the simple sampling procedure of Section 4.. in the first-stage to pick the k initial centers. Although the running time is exponential in k, it is significantly better than the running time of O ( (k/ω)o() nd ) incurred by the algorithm of Kumar et al.; we also obtain a simpler PTAS. Both of these features can be traced to the separation condition, which enables us to nearly isolate the positions of the optimal centers in the first stage. Kumar et al. do not have any such facility, and therefore need to sequentially guess (i.e., exhaustively search) the various centroids, incurring a corresponding increase in the run time. This PTAS is described in Section Seeding procedures used in stage I 4.. Sampling. We pick k initial centers as follows: first pick two centers ĉ, ĉ as in the -means case, that is, choose x, y X with probability proportional to x y. Suppose we have already picked i centers ĉ,..., ĉ i. Now pick a random point x X with probability proportional to min j {,...,i} x ĉ j and set that as center ĉ i+. Running time The sampling procedure consists of k iterations, each of which takes O(nd) time. This is because after sampling a new point ĉ i+, we can update the quantity min j {,...,i+} x ĉ j for each point x in O(d) time. So the overall running time is O(nkd). Analysis Let ɛ < be a parameter that we will set later. As in the -means case, we define the core of cluster X i as Xi cor = { } x X i : x c i r i. We show that under our separation assumption, the above sampling procedure will pick the k initial centers to lie in the cores of the clusters X,..., X k with probability ( O() ) k. We also

6 show that if we sample more than k, but still O(k), points, then with constant probability, every cluster will contain a sampled point that lies in a somewhat larger core, that we call the outer core of the cluster. This analysis will be useful in Section Lemma 4. With probability O(), the first two centers ĉ, ĉ lie in the cores of different clusters, that is, Pr[ i j (x Xcor i and y Xj cor )] = O(). Proof: The key observation is that for any pair of distinct clusters X i, X j, the -means separation condition holds for X i X j, that is, (X i X j ) = (X i ) + (X j ) ɛ (X i X j ). So using Lemma 3. we obtain that x Xi cor,y X x y = ( ) j cor O() {x,y} X i X j x y. Summing over all pairs i, j yields the lemma. Now inductively suppose that the first i centers picked ĉ,..., ĉ i lie in the cores of clusters X j,..., X ji. We show that conditioned on this event, ĉ i+ l/ {j,...,j i} Xcor l with probability O(). Given a set S of points, we use d(x, S) to denote min y S x y. Lemma 4. Pr [ ĉ i+ lies in l/ {j ],...,j i} Xcor l lie in the cores of X j,..., X ji = O(). max x X cor j ĉ,..., ĉ i Proof: For convenience, we re-index the clusters so that {j,..., j i } = {,..., m}. Let Ĉ = {ĉ,..., ĉ i }. For any cluster X j, let p j {,..., i} be such that d(c j, Ĉ) = c j ĉ pj. Let A = k j=m+ x X d(x, j cor Ĉ), and B = k j= x X j d(x, Ĉ). Observe that the probability of the event stated in the lemma is exactly A/B. Let α denote the maximum over all j m + of the quantity x c j /d(c j, Ĉ). For any point x Xcor j, j m+, we have d(x, Ĉ) ( α)d(c j, Ĉ). By Lemma 3., α ɛ/ ( ) / ( ) enough. Therefore, A = k ɛ ( ) j=m+ < for a small x Xj cor d(x, Ĉ) k j=m+ ( )( α) n j d(c j, Ĉ). On the other hand, for any point x X j, we have d(x, Ĉ) x ĉ p j. For j =,..., m, ĉ pj lies in Xj cor, so c j ĉ pj rj. Therefore, B k j=,x X j x ĉ pj k ( j= (X j ) + n j c j ĉ pj ) ( ) + k (X) + k j=m+ n jd(c j, Ĉ). Finally, for any j = m +,... k, if we assign all the points in cluster X j to the point ĉ pj, then the increase in cost is exactly n j c j ĉ pj and at least k (X) k (X). Therefore ( ɛ ) k (X) n j d(c j, Ĉ) for any j = m +,..., k, and B +ɛ / k j=m+ n jd(c j, Ĉ). Comparing with A and plugging in the value of α, we get that A = ( O( + ɛ ) ) B. If we set = Ω(ɛ /3 ), we obtain A/B = O(). Next, we analyze the case when more than k points are sampled. Let = 3. Define the outer core of X i to be Xi out = {x X i : x c i r i }. Note that Xi out. Let N = k 5 + ln(/δ) ( 5) where 0 < δ < is a desired error tolerance. By mimicking the proof of X cor i Lemma 4., one can show (Lemma 4.3) that at every sampling step, there is a constant probability that the sampled point lies in the core of some cluster whose outer core does not contain a previously sampled point. Observe that while Lemma 4. only shows that the good event happens conditioned on the fact that previous samples were also good, Lemma 4.3 gives an unconditional bound. This crucial difference allows us to argue, via a straightforward martingale analysis, that if we sample N points from X, then with some constant probability, each outer core Xi out will contain a sampled point. Corollary 4.4 summarizes the results. Lemma 4.3 Suppose that we have sampled i points {ˆx,..., ˆx i } from X. Let X,..., X m be all the clusters whose outer cores contain some sampled point ˆx j. Then Pr[ˆx i+ k j=m+ Xcor j ] 5. Corollary 4.4 (i) If we sample k points ĉ,..., ĉ k, then with probability ( O() ) k, = Ω(ɛ /3 ), each point ĉ i lies in a distinct core Xi cor, so ĉ i c i r i /. (ii) If we sample N = k 5 + ln(/) ( 5) points ĉ,..., ĉ N, = ɛ, then with probability O(), each outer core X out i contains some ĉ i, so ĉ i c i r i / Greedy deletion procedure. We maintain a set of centers Ĉ that are used to cluster X. We initialize Ĉ X and delete centers till k centers remain. For any point x R d, let R(x) X denote the points of X in the Voronoi region of x (given the set of centers Ĉ). We call R(x) the Voronoi set of x. Repeat the following steps until Ĉ = k. B. Compute T = x Ĉ y R(x) y x, the cost of clustering X around Ĉ. For every x Ĉ, compute T x = y R z Ĉ\{x} y x(z) z, where R x (z) is the Voronoi set of z given the center set Ĉ \ {x}. B. Pick the center y Ĉ for which T x T is minimum and set Ĉ Ĉ \ {y}. B3. Recompute the Voronoi sets R(x) = R y (x) X for each (remaining) center x Ĉ. Now we move the centers to the centroids of their respective (new) Voronoi sets, that is, for every set R(x), we update Ĉ Ĉ \ {x} {ctr(r(x))}.

7 Running time There are n k iterations of the B-B3 loop. Each iteration takes O(n d) time: computing T and the sets R(x) for each x takes O(n d) time and we can then compute each T x in O( R(x) d) time (since while computing T, we can also compute for each point its second-nearest center in Ĉ). Therefore the overall running time is O(n3 d). Analysis Let be a parameter such that 0, ɛ/ ( ɛ ) 4. Recall that D i = min j i c j c i. Define d i = k (X)/n i. We use a different notion of a cluster-core here, but one that still captures the fact that the core consists of points that are quite close to the clustercenter compared to the inter-cluster distance, and contains most of the mass of the cluster. Let B(x, r) = {y R d : x y r}. Define Z i = B(c i, d i / ) and the core of X i as Xi cor = X i Z i. Observe that r i d i, so by Markov s inequality Xi cor ( )n i. Also, since k (X) k (X) n idi we have that d i D i ɛ. Therefore, Xi cor = X Z i. We prove that, at the start of every iteration, for every i, there is a (distinct) center x Ĉ that lies in Z i. Call this invariant (*). Clearly (*) holds at the beginning, since Ĉ = X and Xcor i for every cluster X i. First, we argue (Lemma 4.5) that if x Ĉ is the only center that lies in a slightly enlarged version of the ball Z i for some i, then x is not deleted. The intuition is that there must be some other center c j whose Voronoi region (wrt. optimal center-set) contains at least two centers from Ĉ, and deleting one of these centers, the one further away from c j, should be less expensive. Lemma 4.6 then shows that even after a center y is deleted, if the new Voronoi region R y (x) of a center x Ĉ captures points from some Xj cor, then R y (x) cannot extend too far into some other cluster X l, that is, for any x R y (x) X l where l j, x c j is not much larger than x c l. This is due to the fact that for both clusters X j and X l, there are undeleted centers (due to the invariant and Lemma 4.5) z j Z j and z l Z l. Thus, since R y (x) Xj cor, it must be that x is relatively close to c j, Also, since x is captured by x and not z l, one can lower bound x z l, and hence x c l, in terms of x x (and d l ), which in turn can be lower bounded in terms of x c j (and dj ). Lemma 4.7 puts everything together to show that invariant (*) is maintained. Lemma 4.5 Suppose (*) holds at the start of an iteration, and x Ĉ is the only center in B(c i, 4d i / ) for some cluster X i, then x Ĉ after step B. Lemma 4.6 Suppose center y Ĉ is deleted in step B. Let x Ĉ \ {y} be such that R y(x) Xj cor for some j. Then for any x R y (x) X l, l j we have x c j x c l + max(d l+6d j,4d l +3d j). Lemma 4.7 Suppose that property (*) holds at the beginning of some iteration in the deletion phase. Then (*) also holds at the end of the iteration, i.e., after step B3. Proof: Suppose that we delete center y Ĉ that lies in the Voronoi region of center c i (wrt. optimal centers) in step B. Let Ĉ = Ĉ \ {y} and R (x) = R y (x) for any x Ĉ. Fix a cluster X j. Let S = {x Ĉ : R (x) Xj cor } and Y = x S R (x). We show that there is some set R (x), x Ĉ whose centroid ctr(r (x)) lies in the ball Z j, which will prove the lemma. By Lemma 4.6 and noting that d l ɛ D l for every l, for any x Y X l where l j, we have x c j x ɛ c l + max(d ( l + 6D j, 4D l + 3D j ). Also ) D j, D l c j c l x c j + x c l. Substituting for D j, D l we get that y c j β y c l where β = +7ɛ/ ( ). Using this we obtain that A = 7ɛ/ ( ) x Y x c j β k l= x Y X l x c l β k (X). We also have A = x S x R (x) y c j = x S( (R (x))+ R (x) ctr(r (x)) c j ) Y min x S ctr(r (x)) c j. Since Xj cor Y we have Y ( )n j, so min x S ctr(r (x)) c j β d i. The bounds on ensure that β, so that min x S ctr(r (x)) c j dj. Corollary 4.8 After the deletion phase, for every i, there is a center ĉ i Ĉ with ĉ ɛ i c i D ( i. ) 4..3 A linear time seeding procedure. We now combine the sampling idea with the deletion procedure to obtain a seeding procedure that runs in time O(nkd+k 3 d) and succeeds with high probability. We first sample O(k) points from X using the sampling procedure. Then we run the deletion procedure on an O(k)-size instance consisting of the centroids of the Voronoi regions of the sampled, points, with each centroid having a weight equal to the mass of X in its corresponding Voronoi region. The sampling process ensures that with high probability, every cluster X i contains a point ĉ i that is close to its center c i. This will allow us to argue that the k (.) cost of the sampled instance is much smaller than its k (.) cost, and that the optimal centers for the sampled instance lie near the optimal centers for X. By the analysis of the previous section, one can then show that after the deletion procedure the k centers are still close to the optimal centers for the sampled instance, and hence also close to the optimal centers for X. Fix = ɛ. C. Sampling. Sample N = k 5 points from X using the sampling procedure of Section 4... Let S denote the set of sampled points. + ln(/) ( 5 )

8 C. Deletion phase. For each x S, let R(x) = {y X : y x = min z S y z } be its Voronoi set (wrt. the set S). We now ignore X, and consider a weighted instance with point-set Ŝ = {ˆx = ctr(r(x)) : x S}, where each ˆx has a weight w(ˆx) = R(x). Run the deletion procedure of Section 4.., on Ŝ to obtain k centers ĉ,..., ĉ k. Running time Step C takes O(nNd) = O(nkd) time. The run-time analysis of the deletion phase in Section 4.., shows that step C takes O(N 3 d) = O(k 3 d) time. So the overall running time is O(nkd + k 3 d). Analysis Recall that = ɛ. Let = 3. Let Xi cor = {x X i : x c i r i }. Let Xi out = {x X i : x c i r i } denote the outer core of cluster X i. By part (ii) of Corollary 4.4 we know that with probability O( ), every cluster X i contains a sampled point in its outer core after step C. So assume that this event happens. Let ŝ,..., ŝ k denote the optimal k centers for Ŝ and ĉ,..., ĉ k be the centers returned by the deletion phase. Lemma 4.9 shows that the k-means separation condition also holds for Ŝ, and the optimal centers for Ŝ are close to the optimal centers for X. This implies that the centers returned by the deletion phase are close to the optimal centers for X. Lemma 4.9 k (Ŝ) = O(ɛ ) k (Ŝ). For each optimal center c i, there is some ŝ i such that ŝ i c i Di 5 + ri. Lemma 4.0 For each center c i, there is a center ĉ i such that ĉ i c i Di 0. Proof: Let ˆD i = min j i ŝ j ŝ i. Then ( θ) ˆD i D i ( ( + θ) where θ 5 + ɛ. Since = ( )) ɛ, we can ensure that θ <. Choosing for the deletion phase suitably, by Corollary 4.8, we can ensure that we obtain ĉ i such that ĉ i ŝ i ˆD i 0. Thus, ĉ i c i Di Procedures used in stage II Given k seed centers ĉ,..., ĉ k located sufficiently close to the optimal centers after stage I, we use two procedures in stage II to obtain a near-optimal clustering: the ball-kmeans step, which yields a ( + f(ɛ) ) -approximation algorithm, or the centroid estimation step, based on a sampling idea of Kumar et al. [30], which yields a PTAS with running time exponential in k. Define ˆd i = min j i ĉ j ĉ i. Ball-k-means step. Let B i be the points of X in a ball of radius ˆd i /3 around ĉ i, and c i be the centroid of B i. Return c,..., c k as the final centers. Centroid estimation. For each i, we will obtain a set of 5 candidate centers for cluster X i. Fix β = 5+56ɛ. Let R i X = {x X : x ĉ i min j x ĉ j + ˆd i /4} be the expanded Voronoi region of ĉ i. Sample 4 βω points independently and uniformly at random from R i, where ω is a given input parameter, to obtain a random subset S i R i. Compute the centroid of every subset of S i of size ω ; let T i be the set consisting of all these centroids. Select the candidates c T,..., c k T k that yield the least-cost solution, and return these as the final centers. Analysis Recall that D i = min j i c j c i. Let = 36ɛ. The proof of Lemma 4., which analyzes the ballk-means step, is essentially identical to that of Lemma 3.3. Lemma 4. Let Y i = { } x X i : x c i r i. If ĉ i c i D i /0 for each i, then Y i B i X i, and c i c i r i. Lemma 4. Suppose ĉ i c i D i /0 for each i. Then X i R i, where R i is as defined above, and X i β R i. Proof: We have 4Di 5 ˆd i 6Di 5. Consider any x X i that lies in the Voronoi region of ĉ j. We have x c i x c j, therefore x ĉ i x ĉ j +D i /5 x ĉ j + ˆd i /4; so x R i. Suppose X i β R i. Let a j = R i Xj ai a i β β R i. So. Consider the clustering where we arbitrarily assign some a j a i points of X i to center c j for each j i. For any x X i and j i, we have x c j ( x c i + c i c j ). So the cost of reassigning points in X i is at most (X i ) + ni a i j i a j c i c j (X i ) + β β j i a j R i c i c j. We also know that for any y R i X j, y c i y c j + Di+Dj 0, which implies that c i c j 8 5 y c j. Therefore, we y R i Xj y c j. can bound a j R i c i c j by 64 5 Hence, the cost of this clustering is at most max (, + ) 8β 5( β) k (X) ( ) + ɛ k (X). The cost of this clustering is also at least k (X). This is a contradiction to the assumption that k (X) ɛ k (X) A linear time constant-factor approximation algorithm D. Execute the seeding procedure of Section 4..3 to obtain k initial centers ĉ,..., ĉ k. D. Run the ball-k-means step of Section 4. to obtain the final centers. The running time is O(nkd + k 3 d). By Lemma 4.0, we know that with probability O( ɛ), for each c i, there is

9 a distinct center ĉ i such that ĉ i c i D i /0. Combined with Lemma 4., this yields the following theorem. Theorem 4.3 If k (X) k (X) ɛ for a small enough ɛ, the above algorithm returns a solution of cost at most 37ɛ k (X) with probability O( ɛ) in time O(nkd + k 3 d) A PTAS for any fixed k The PTAS combines the sampling procedure of Section 4.. with the centroid estimation step. E. Use the procedure in Section 4.. to pick k initial centers ĉ,..., ĉ k. E. Run the centroid estimation procedure of Section 4. to obtain the final centers. The running time is dominated by the exhaustive search in step E which takes time O ( (4k/βω) nd ). We show that the cost of the final solution is at most ( + ω) k (X), with probability γ k for some constant γ. By repeating the procedure O(γ k ) times, we can boost this to a constant. Theorem 4.4 Assuming that k (X) ɛ k (X) for a small enough ɛ, there is a PTAS for the k-means problem that returns a ( + ω)-optimal solution with constant probability in time O( O(k(+ɛ )/ω) nd). Proof: By appropriately setting in the sampling procedure, we can ensure that with probability Θ() k, it returns centers ĉ,..., ĉ k such that for each i, ĉ i c i D i /0 (part (i) of Corollary 4.4). So by Lemma 4. we know that X i β R i for every i. Now Lemma.3 in [30] shows that for every i, with constant probability, there is some candidate point c i T i such that x X i x c i ( + ω) (X i ). The cost of the best-candidate solution is at most the cost of the solution due to the points c T,..., c k T k. The overall success probability for one call of the procedure is γ k for some constant γ <, so by repeating the procedure O(γ k ) times we can obtain constant success probability. 5. The separation condition We show that our separation condition implies, and is implied by, the condition that any two near-optimal k- clusterings disagree on only a small fraction of the data. Let cost(x,..., x k ) denote the cost of clustering X around the centers x,..., x k R d. We use R(x) to denote the Voronoi region of point x (the centers will be clear from the context). Let S S = (S \ S ) (S \ S ) denote the symmetric difference of S and S. Theorem 5. Let α = 40ɛ 400. Suppose that X R d is ɛ-separated for k-means for a small enough ɛ. The following hold: (i) If there are centers ĉ,..., ĉ k such that cost(ĉ,..., ĉ k ) α k (X), then for each ĉ i there is a distinct optimal center c σ(i) such that R(ĉ i ) X σ(i) 8ɛ X σ(i) ; (ii) If ˆX is obtained by perturbing each x Xi by a distance of ɛ k (X) n then k ( ˆX) = O(ɛ ) k ( ˆX). Notice that part (i) also yields that if X is ɛ-separated for k-means, then any k-clustering of cost at most α ɛ k (X) has small Hamming distance to the optimal k-clustering (more strongly, each cluster has small Hamming distance to a distinct optimal cluster). We now show the converse. Theorem 5. Let ɛ 3. Suppose that for every k- clustering ˆX,..., ˆX k of X of cost at most α k (X), (i) there exists a bijection σ such that i, ˆX i X σ(i) ɛ X σ(i). Then, X is α-separated for k-means; (ii) there is a bijection σ such that k i= ˆX i X σ(i) X. Then, X is α-separated for k-means. ɛ k Proof: Let R,..., R k be an optimal (k )-means solution. We will construct a refinement of R,..., R k and argue that this has large Hamming distance to X,..., X k, and hence high cost, implying that k (X)/ k (X) is large. Let R k be the largest cluster. We start with an arbitrary refinement R,..., R k, ˆX k, ˆX k where ˆX k ˆX k = R k, ˆXk, ˆX k. If this has high cost, then we are done, otherwise let σ be the claimed bijection. For part (i), we introduce a large disagreement by splitting ˆX k X σ(k ) and ˆX k X σ(k) into two equal-sized halves, A k B k and A k B k respectively, and mismatching them. More precisely, we claim that the clustering R,..., R k, X k = ( ˆX k \ A k ) A k, X k = ( ˆX k \ A k ) A k has large Hamming distance. For any bijection σ, if σ (i) σ(i) for i k, then R i X σ (i) R i X σ(i) ( ɛ) R i ; otherwise, σ (k) {σ(k ), σ(k)}, so X k X σ (k) X k since X k \ X σ(k ) B k, X k \ X σ(k) A k. For part (ii), since R k k, we have ˆX k X σ(k ) + ˆX k X σ(k) k X. After the above mismatch operation, for any bijection σ, the total disagreement is at least X k X σ (k ) + X k X σ (k) ( ˆXk X σ(k ) + ˆX k X σ(k) ) (k ) X. References X [] K. Alsabti, S. Ranka, and V. Singh. An efficient k-means clustering algorithm. In Proc. st Workshop on High Performance Data Mining, 998.

10 [] D. Arthur and S. Vassilvitskii. How slow is the k-means method? In Proc. nd SoCG, pages 44 53, 006. [3] V. Arya, N. Garg, R. Khandekar, A. Meyerson, K. Munagala, and V. Pandit. Local search heuristics for k-median and facility location problems. SICOMP, 33:544 56, 004. [4] M. Badoiu, S. Har-Peled, and P. Indyk. Approximate clustering via core-sets. Proc. 34th STOC, pages 50 57, 00. [5] P. S. Bradley and U. Fayyad. Refining initial points for K- means clustering. In Proc. 5th ICML, pages 9 99, 998. [6] M. Charikar and S. Guha. Improved combinatorial algorithms for the facility location and k-median problems. In Proc. 40th FOCS, pages , 999. [7] M. Charikar, S. Guha, É. Tardos, and D. B. Shmoys. A constant-factor approximation algorithm for the k-median problem. J. Comput. and Syst. Sci., 65:9 49, 00. [8] M. Chrobak, C. Kenyon, and N. Young. The reverse greedy algorithm for the metric k-median problem. Information Processing Letters, 97:68 7, 006. [9] D. R. Cox. Note on grouping. J. American Stat. Assoc., 5: , 957. [0] S. Dasgupta. How fast is k-means? In Proc. 6th COLT, page 735, 003. [] W. F. de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemes for clustering problems. In Proc. 35th ACM STOC, pages 50 58, 003. [] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. R. Statist. Soc. B, 39: 38, 977. [3] P. Drineas, A. Frieze, R. Kannan, S. Vempala, and V. Vinay. Clustering large graphs via the Singular Value Decomposition. Machine Learning, 56:9 33, 004. [4] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley-Interscience, 000. [5] M. Effros and L. J. Schulman. Deterministic clustering with data nets. Electronic Tech Report ECCC TR04-050, 004. [6] M. Effros and L. J. Schulman. Deterministic clustering with data nets. In Proc. ISIT, 004. [7] D. Fisher. Iterative optimization and simplification of hierarchical clusterings. J. Artif. Intell. Res., 4:47 78, 996. [8] E. Forgey. Cluster analysis of multivariate data: efficiency vs. interpretability of classification. Biometrics, :768, 965. [9] A. Gersho and R. M. Gray. Vector quantization and signal compression. Kluwer, 99. [0] R. M. Gray and D. L. Neuhoff. Quantization. IEEE Trans. Inform. Theory, 44(6):35 384, October 998. [] S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In Proc. 36th STOC, pages 9 300, 004. [] S. Har-Peled and B. Sadri. How fast is the k-means method? Algorithmica, 4:85 0, 005. [3] R. E. Higgs, K. G. Bemis, I. A. Watson, and J. H. Wikel. Experimental designs for selecting molecules from large chemical databases. J. Chem. Inf. Comp. Sci., 37:86 870, 997. [4] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 3(3), September 999. [5] K. Jain, M. Mahdian, E. Markakis, A. Saberi, and V. Vazirani. Greedy facility location algorithms analyzed using dual-fitting with factor-revealing LP. JACM, 50:795 84, 003. [6] K. Jain and V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and Lagrangian relaxation. JACM, 48:74 96, 00. [7] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Wu. A local search approximation algorithm for k-means clustering. Comput. Geom., 8:89, 004. [8] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell., 4:88 89, 00. [9] L. Kaufman and P. J. Rousseeuw. Finding groups in data. An introduction to cluster analysis. Wiley, 990. [30] A. Kumar, Y. Sabharwal, and S. Sen. A simple linear time ( + ε)-approximation algorithm for k-means clustering in any dimensions. In Proc. 45th FOCS, pages , 004. [3] Y. Linde, A. Buzo, and R. M. Gray. An algorithm for vector quantization design. IEEE Trans. Commun., COM-8:84 95, January 980. [3] S. P. Lloyd. Least squares quantization in PCM. Special issue on quantization, IEEE Trans. Inform. Theory, 8:9 37, 98. [33] J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symp. on Math. Statistics and Probability, pages 8 97, 967. [34] J. Matousek. On approximate geometric k-clustering. Discrete & Computational Geometry, 4:6 84, 000. [35] J. Max. Quantizing for minimum distortion. IEEE Trans. Inform. Theory, IT-6():7, March 960. [36] M. Meila and D. Heckerman. An experimental comparison of several clustering and initialization methods. In Proc. 4th UAI, pages , 998. [37] R. R. Mettu and C. G. Plaxton. Optimal time bounds for approximate clustering. Machine Learning, 56:35 60, 004. [38] G. Milligan. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45:35 34, 980. [39] R. Ostrovsky and Y. Rabani. Polynomial time approximation schemes for geometric clustering problems. JACM, 49():39 56, 00. [40] D. Pelleg and A. Moore. Accelerating exact k-means algorithms with geometric reasoning. In Proc. 5th ACM KDD, pages 77 8, 999. [4] J. M. Pena, J. A. Lozano, and P. Larranaga. An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Lett., 0:07 040, 999. [4] S. J. Phillips. Acceleration of k-means and related clustering problems. In Proc. 4th ALENEX, 00. [43] L. J. Schulman. Clustering for edge-cost minimization. In Proc. 3nd ACM STOC, pages , 000. [44] M. Snarey, N. K. Terrett, P. Willet, and D. J. Wilton. Comparison of algorithms for dissimilarity-based compound selection. J. Mol. Graphics and Modelling, 5:37 385, 997. [45] D. Spielman and S. Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. In Proc. 33rd ACM STOC, pages , 00. [46] H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci., C. III vol IV:80 804, 956. [47] R. C. Tryon and D. E. Bailey. Cluster Analysis. McGraw- Hill, 970. Pages

### Applied Algorithm Design Lecture 5

Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design

Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650

### A Simple Linear Time (1 + ε)-approximation Algorithm for k-means Clustering in Any Dimensions

A Simple Linear Time (1 + ε)-approximation Algorithm for k-means Clustering in Any Dimensions Amit Kumar Dept. of Computer Science & Engg., IIT Delhi New Delhi-110016, India amitk@cse.iitd.ernet.in Yogish

Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one

### Constant Factor Approximation Algorithm for the Knapsack Median Problem

Constant Factor Approximation Algorithm for the Knapsack Median Problem Amit Kumar Abstract We give a constant factor approximation algorithm for the following generalization of the k-median problem. We

### A constant-factor approximation algorithm for the k-median problem

A constant-factor approximation algorithm for the k-median problem Moses Charikar Sudipto Guha Éva Tardos David B. Shmoys July 23, 2002 Abstract We present the first constant-factor approximation algorithm

### Fairness in Routing and Load Balancing

Fairness in Routing and Load Balancing Jon Kleinberg Yuval Rabani Éva Tardos Abstract We consider the issue of network routing subject to explicit fairness conditions. The optimization of fairness criteria

### Offline sorting buffers on Line

Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com

### Smaller Coresets for k-median and k-means Clustering

Smaller Coresets for k-median and k-means Clustering Sariel Har-Peled Akash Kushal UIUC, Urbana, IL Coresets for Clustering Clustering P: set of n points in IR d, compute a set C of k centers k-median:

### ! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three

### Lecture 4 Online and streaming algorithms for clustering

CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

### ! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of

### A Note on Maximum Independent Sets in Rectangle Intersection Graphs

A Note on Maximum Independent Sets in Rectangle Intersection Graphs Timothy M. Chan School of Computer Science University of Waterloo Waterloo, Ontario N2L 3G1, Canada tmchan@uwaterloo.ca September 12,

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### Data Stability in Clustering: A Closer Look

Data Stability in Clustering: A Closer Look Lev Reyzin School of Computer Science Georgia Institute of Technology 266 Ferst Drive Atlanta, GA 30332 lreyzin@cc.gatech.edu Abstract. We consider the model

### Distance based clustering

// Distance based clustering Chapter ² ² Clustering Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 99). What is a cluster? Group of objects separated from other clusters Means

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Lecture 20: Clustering

Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

### EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

### Lecture 7: Approximation via Randomized Rounding

Lecture 7: Approximation via Randomized Rounding Often LPs return a fractional solution where the solution x, which is supposed to be in {0, } n, is in [0, ] n instead. There is a generic way of obtaining

### A Brief Introduction to Property Testing

A Brief Introduction to Property Testing Oded Goldreich Abstract. This short article provides a brief description of the main issues that underly the study of property testing. It is meant to serve as

### Clustering Lines in High Dimensional Space: Classification of Incomplete Data

Clustering Lines in High Dimensional Space: Classification of Incomplete Data JIE GAO Stony Brook University MICHAEL LANGBERG The Open University of Israel LEONARD J. SCHULMAN California Institute of Technology

### An example of a computable

An example of a computable absolutely normal number Verónica Becher Santiago Figueira Abstract The first example of an absolutely normal number was given by Sierpinski in 96, twenty years before the concept

### International Journal of Information Technology, Modeling and Computing (IJITMC) Vol.1, No.3,August 2013

FACTORING CRYPTOSYSTEM MODULI WHEN THE CO-FACTORS DIFFERENCE IS BOUNDED Omar Akchiche 1 and Omar Khadir 2 1,2 Laboratory of Mathematics, Cryptography and Mechanics, Fstm, University of Hassan II Mohammedia-Casablanca,

### Analysis of Approximation Algorithms for k-set Cover using Factor-Revealing Linear Programs

Analysis of Approximation Algorithms for k-set Cover using Factor-Revealing Linear Programs Stavros Athanassopoulos, Ioannis Caragiannis, and Christos Kaklamanis Research Academic Computer Technology Institute

### Approximation Algorithms

Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms

### 2.1 Complexity Classes

15-859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look

### Data Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototype-based Fuzzy c-means Mixture Model Clustering Density-based

### A Smoothed Analysis of the k-means Method

A Smoothed Analysis of the k-means Method DAVID ARTHUR, Stanford University, Department of Computer Science BODO MANTHEY, University of Twente, Department of Applied Mathematics HEIKO RÖGLIN, University

### How Slow is the k-means Method?

ow Slow is the k-means Method? David Arthur Stanford University Stanford, CA darthur@cs.stanford.edu Sergei Vassilvitskii Stanford University Stanford, CA sergei@cs.stanford.edu ABSTRACT The k-means method

### Standardization and Its Effects on K-Means Clustering Algorithm

Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03

### Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

### Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard

Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,

### CSC2420 Fall 2012: Algorithm Design, Analysis and Theory

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms

### 1. R In this and the next section we are going to study the properties of sequences of real numbers.

+a 1. R In this and the next section we are going to study the properties of sequences of real numbers. Definition 1.1. (Sequence) A sequence is a function with domain N. Example 1.2. A sequence of real

### Factoring & Primality

Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount

### Private Approximation of Clustering and Vertex Cover

Private Approximation of Clustering and Vertex Cover Amos Beimel, Renen Hallak, and Kobbi Nissim Department of Computer Science, Ben-Gurion University of the Negev Abstract. Private approximation of search

### arxiv:1112.0829v1 [math.pr] 5 Dec 2011

How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly

### An Impossibility Theorem for Clustering

An Impossibility Theorem for Clustering Jon Kleinberg Department of Computer Science Cornell University Ithaca NY 14853 Abstract Although the study of clustering is centered around an intuitively compelling

### Notes from Week 1: Algorithms for sequential prediction

CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking

### Cluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico

Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from

### CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma Please Note: The references at the end are given for extra reading if you are interested in exploring these ideas further. You are

### Online and Offline Selling in Limit Order Markets

Online and Offline Selling in Limit Order Markets Kevin L. Chang 1 and Aaron Johnson 2 1 Yahoo Inc. klchang@yahoo-inc.com 2 Yale University ajohnson@cs.yale.edu Abstract. Completely automated electronic

### A simpler and better derandomization of an approximation algorithm for Single Source Rent-or-Buy

A simpler and better derandomization of an approximation algorithm for Single Source Rent-or-Buy David P. Williamson Anke van Zuylen School of Operations Research and Industrial Engineering, Cornell University,

### Linear programming approach for online advertising

Linear programming approach for online advertising Igor Trajkovski Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University in Skopje, Rugjer Boshkovikj 16, P.O. Box 393, 1000 Skopje,

### About the inverse football pool problem for 9 games 1

Seventh International Workshop on Optimal Codes and Related Topics September 6-1, 013, Albena, Bulgaria pp. 15-133 About the inverse football pool problem for 9 games 1 Emil Kolev Tsonka Baicheva Institute

### The Online Set Cover Problem

The Online Set Cover Problem Noga Alon Baruch Awerbuch Yossi Azar Niv Buchbinder Joseph Seffi Naor ABSTRACT Let X = {, 2,..., n} be a ground set of n elements, and let S be a family of subsets of X, S

### Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization

Adaptive Search with Stochastic Acceptance Probabilities for Global Optimization Archis Ghate a and Robert L. Smith b a Industrial Engineering, University of Washington, Box 352650, Seattle, Washington,

### Online k-means Clustering of Nonstationary Data

15.097 Prediction Proect Report Online -Means Clustering of Nonstationary Data Angie King May 17, 2012 1 Introduction and Motivation Machine learning algorithms often assume we have all our data before

### Lecture 6 Online and streaming algorithms for clustering

CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

### Using the Triangle Inequality to Accelerate

Using the Triangle Inequality to Accelerate -Means Charles Elkan Department of Computer Science and Engineering University of California, San Diego La Jolla, California 92093-0114 ELKAN@CS.UCSD.EDU Abstract

### 1.1 Related work For non-preemptive scheduling on uniformly related machines the rst algorithm with a constant competitive ratio was given by Aspnes e

A Lower Bound for On-Line Scheduling on Uniformly Related Machines Leah Epstein Jir Sgall September 23, 1999 lea@math.tau.ac.il, Department of Computer Science, Tel-Aviv University, Israel; sgall@math.cas.cz,

Mechanisms for Fair Attribution Eric Balkanski Yaron Singer Abstract We propose a new framework for optimization under fairness constraints. The problems we consider model procurement where the goal is

### B490 Mining the Big Data. 2 Clustering

B490 Mining the Big Data 2 Clustering Qin Zhang 1-1 Motivations Group together similar documents/webpages/images/people/proteins/products One of the most important problems in machine learning, pattern

### CSE 494 CSE/CBS 598 (Fall 2007): Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye

CSE 494 CSE/CBS 598 Fall 2007: Numerical Linear Algebra for Data Exploration Clustering Instructor: Jieping Ye 1 Introduction One important method for data compression and classification is to organize

### Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

### Cluster Analysis for Optimal Indexing

Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference Cluster Analysis for Optimal Indexing Tim Wylie, Michael A. Schuh, John Sheppard, and Rafal A.

### Topic: Greedy Approximations: Set Cover and Min Makespan Date: 1/30/06

CS880: Approximations Algorithms Scribe: Matt Elder Lecturer: Shuchi Chawla Topic: Greedy Approximations: Set Cover and Min Makespan Date: 1/30/06 3.1 Set Cover The Set Cover problem is: Given a set of

### A Near-linear Time Constant Factor Algorithm for Unsplittable Flow Problem on Line with Bag Constraints

A Near-linear Time Constant Factor Algorithm for Unsplittable Flow Problem on Line with Bag Constraints Venkatesan T. Chakaravarthy, Anamitra R. Choudhury, and Yogish Sabharwal IBM Research - India, New

### K-median Algorithms: Theory in Practice

K-median Algorithms: Theory in Practice David Dohan, Stefani Karp, Brian Matejek Submitted: January 16, 2015 1 Introduction 1.1 Background and Notation The k-median problem has received significant attention

### Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

### 8.1 Makespan Scheduling

600.469 / 600.669 Approximation Algorithms Lecturer: Michael Dinitz Topic: Dynamic Programing: Min-Makespan and Bin Packing Date: 2/19/15 Scribe: Gabriel Kaptchuk 8.1 Makespan Scheduling Consider an instance

### Breaking Generalized Diffie-Hellman Modulo a Composite is no Easier than Factoring

Breaking Generalized Diffie-Hellman Modulo a Composite is no Easier than Factoring Eli Biham Dan Boneh Omer Reingold Abstract The Diffie-Hellman key-exchange protocol may naturally be extended to k > 2

### Unsupervised learning: Clustering

Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

### Lecture 3: Linear Programming Relaxations and Rounding

Lecture 3: Linear Programming Relaxations and Rounding 1 Approximation Algorithms and Linear Relaxations For the time being, suppose we have a minimization problem. Many times, the problem at hand can

### Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,

### LOCAL SEARCH HEURISTICS FOR k-median AND FACILITY LOCATION PROBLEMS

SIAM J. COMPUT. Vol. 33, No. 3, pp. 544 562 c 2004 Society for Industrial and Applied Mathematics LOCAL SEARCH HEURISTICS FOR k-median AND FACILITY LOCATION PROBLEMS VIJAY ARYA, NAVEEN GARG, ROHIT KHANDEKAR,

### Compact Representations and Approximations for Compuation in Games

Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions

### Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

### Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

### Fig. 1 A typical Knowledge Discovery process [2]

Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

### The Planar k-means Problem is NP-hard

The Planar k-means Problem is NP-hard Meena Mahajan a, Prajakta Nimbhorkar a, Kasturi Varadarajan b a The Institute of Mathematical Sciences, Chennai 600 113, India. b The University of Iowa, Iowa City,

### Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: Pauline.Schrijner@durham.ac.uk

### Ecient approximation algorithm for minimizing makespan. on uniformly related machines. Chandra Chekuri. November 25, 1997.

Ecient approximation algorithm for minimizing makespan on uniformly related machines Chandra Chekuri November 25, 1997 Abstract We obtain a new ecient approximation algorithm for scheduling precedence

### THREE DIMENSIONAL GEOMETRY

Chapter 8 THREE DIMENSIONAL GEOMETRY 8.1 Introduction In this chapter we present a vector algebra approach to three dimensional geometry. The aim is to present standard properties of lines and planes,

### THE SCHEDULING OF MAINTENANCE SERVICE

THE SCHEDULING OF MAINTENANCE SERVICE Shoshana Anily Celia A. Glass Refael Hassin Abstract We study a discrete problem of scheduling activities of several types under the constraint that at most a single

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### 1 Approximating Set Cover

CS 05: Algorithms (Grad) Feb 2-24, 2005 Approximating Set Cover. Definition An Instance (X, F ) of the set-covering problem consists of a finite set X and a family F of subset of X, such that every elemennt

Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

### max cx s.t. Ax c where the matrix A, cost vector c and right hand side b are given and x is a vector of variables. For this example we have x

Linear Programming Linear programming refers to problems stated as maximization or minimization of a linear function subject to constraints that are linear equalities and inequalities. Although the study

### Nan Kong, Andrew J. Schaefer. Department of Industrial Engineering, Univeristy of Pittsburgh, PA 15261, USA

A Factor 1 2 Approximation Algorithm for Two-Stage Stochastic Matching Problems Nan Kong, Andrew J. Schaefer Department of Industrial Engineering, Univeristy of Pittsburgh, PA 15261, USA Abstract We introduce

### 10-810 /02-710 Computational Genomics. Clustering expression data

10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,

### Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range

THEORY OF COMPUTING, Volume 1 (2005), pp. 37 46 http://theoryofcomputing.org Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range Andris Ambainis

### Fast and Accurate k-llleans For Large Datasets

Fast and Accurate k-llleans For Large Datasets Michael Shindler School of EECS Oregon State University shindler@eecs.oregonstate.edu Alex Wong Department of Computer Science UC Los Angeles alexw@seas.ucla.edu

### The Conference Call Search Problem in Wireless Networks

The Conference Call Search Problem in Wireless Networks Leah Epstein 1, and Asaf Levin 2 1 Department of Mathematics, University of Haifa, 31905 Haifa, Israel. lea@math.haifa.ac.il 2 Department of Statistics,

### An Empirical Study of Two MIS Algorithms

An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. tushar.bisht@research.iiit.ac.in,

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### The Container Selection Problem

The Container Selection Problem Viswanath Nagarajan 1, Kanthi K. Sarpatwar 2, Baruch Schieber 2, Hadas Shachnai 3, and Joel L. Wolf 2 1 University of Michigan, Ann Arbor, MI, USA viswa@umich.edu 2 IBM

### Weighted Clustering. Margareta Ackerman, Shai Ben-David, Simina Branzei, and David Loker

Weighted Clustering Margareta Ackerman, Shai Ben-David, Simina Branzei, and David Loker University of Waterloo D.R.C. School of Computer Science {mackerma, shai, sbranzei, dloker}@uwaterloo.ca Abstract

### Online Scheduling with Bounded Migration

Online Scheduling with Bounded Migration Peter Sanders, Naveen Sivadasan, and Martin Skutella Max-Planck-Institut für Informatik, Saarbrücken, Germany, {sanders,ns,skutella}@mpi-sb.mpg.de Abstract. Consider

### Clustering. Chapter 7. 7.1 Introduction to Clustering Techniques. 7.1.1 Points, Spaces, and Distances

240 Chapter 7 Clustering Clustering is the process of examining a collection of points, and grouping the points into clusters according to some distance measure. The goal is that points in the same cluster

### CONTRIBUTIONS TO ZERO SUM PROBLEMS

CONTRIBUTIONS TO ZERO SUM PROBLEMS S. D. ADHIKARI, Y. G. CHEN, J. B. FRIEDLANDER, S. V. KONYAGIN AND F. PAPPALARDI Abstract. A prototype of zero sum theorems, the well known theorem of Erdős, Ginzburg

### Faster deterministic integer factorisation

David Harvey (joint work with Edgar Costa, NYU) University of New South Wales 25th October 2011 The obvious mathematical breakthrough would be the development of an easy way to factor large prime numbers

### Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy

Astronomical Data Analysis Software and Systems XIV ASP Conference Series, Vol. XXX, 2005 P. L. Shopbell, M. C. Britton, and R. Ebert, eds. P2.1.25 Making the Most of Missing Values: Object Clustering

### IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 The Global Kernel k-means Algorithm for Clustering in Feature Space Grigorios F. Tzortzis and Aristidis C. Likas, Senior Member, IEEE

### 15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

### Compactness in metric spaces

MATHEMATICS 3103 (Functional Analysis) YEAR 2012 2013, TERM 2 HANDOUT #2: COMPACTNESS OF METRIC SPACES Compactness in metric spaces The closed intervals [a, b] of the real line, and more generally the

### A Perspective on Statistical Tools for Data Mining Applications

A Perspective on Statistical Tools for Data Mining Applications David M. Rocke Center for Image Processing and Integrated Computing University of California, Davis Statistics and Data Mining Statistics