Nonconvex? NP! (No Problem!) Ryan Tibshirani Convex Optimization 10-725/36-725



Similar documents
2.3 Convex Constrained Optimization Problems

Duality in General Programs. Ryan Tibshirani Convex Optimization /36-725

Lecture 3: Linear methods for classification

Inner Product Spaces

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Statistical machine learning, high dimension and big data

Statistical Machine Learning

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

1 Introduction to Matrices

Bindel, Spring 2012 Intro to Scientific Computing (CS 3220) Week 3: Wednesday, Feb 8

Discrete Optimization

Proximal mapping via network optimization

Section Inner Products and Norms

Lecture 3. Linear Programming. 3B1B Optimization Michaelmas 2015 A. Zisserman. Extreme solutions. Simplex method. Interior point method

ECE 842 Report Implementation of Elliptic Curve Cryptography

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Fitting Subject-specific Curves to Grouped Longitudinal Data

Linear Classification. Volker Tresp Summer 2015

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Least Squares Estimation

24. The Branch and Bound Method

Adaptive Online Gradient Descent

Data Mining: Algorithms and Applications Matrix Math Review

1 Introduction. Linear Programming. Questions. A general optimization problem is of the form: choose x to. max f(x) subject to x S. where.

Recovery of primal solutions from dual subgradient methods for mixed binary linear programming; a branch-and-bound approach

Penalized Logistic Regression and Classification of Microarray Data

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Linear algebra and the geometry of quadratic equations. Similarity transformations and orthogonal matrices

Approximation Algorithms

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Manifold Learning Examples PCA, LLE and ISOMAP

Linear Threshold Units

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

Chapter 6. Orthogonality

Vector and Matrix Norms

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Least-Squares Intersection of Lines

Seminar. Path planning using Voronoi diagrams and B-Splines. Stefano Martina

Inner Product Spaces and Orthogonality

Lecture 11: 0-1 Quadratic Program and Lower Bounds

1 Formulating The Low Degree Testing Problem

1 Norms and Vector Spaces

Notes on Symmetric Matrices

A Branch and Bound Algorithm for Solving the Binary Bi-level Linear Programming Problem

Principles of Scientific Computing Nonlinear Equations and Optimization

Lecture 5 Principal Minors and the Hessian

Offline sorting buffers on Line

Two-Stage Stochastic Linear Programs

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Lecture 13 Linear quadratic Lyapunov theory

5. Orthogonal matrices

Applied Algorithm Design Lecture 5

Computer Graphics. Geometric Modeling. Page 1. Copyright Gotsman, Elber, Barequet, Karni, Sheffer Computer Science - Technion. An Example.

Linear Codes. Chapter Basics

Binary Image Reconstruction

Several Views of Support Vector Machines

Similarity and Diagonalization. Similar Matrices

x = + x 2 + x

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Linear Algebra Review. Vectors

Discuss the size of the instance for the minimum spanning tree problem.

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Høgskolen i Narvik Sivilingeniørutdanningen STE6237 ELEMENTMETODER. Oppgaver

Quadratic forms Cochran s theorem, degrees of freedom, and all that

DATA ANALYSIS II. Matrix Algorithms

Introduction to General and Generalized Linear Models

University of Lille I PC first year list of exercises n 7. Review

Nonlinear Optimization: Algorithms 3: Interior-point methods

1 VECTOR SPACES AND SUBSPACES

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

An Introduction to Machine Learning

VERTICES OF GIVEN DEGREE IN SERIES-PARALLEL GRAPHS

Introduction to Algebraic Geometry. Bézout s Theorem and Inflection Points

Solution of Linear Systems

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Solving polynomial least squares problems via semidefinite programming relaxations

Complexity Theory. IE 661: Scheduling Theory Fall 2003 Satyaki Ghosh Dastidar

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

BANACH AND HILBERT SPACE REVIEW

CSC2420 Spring 2015: Lecture 3

Support Vector Machines

Smoothing and Non-Parametric Regression

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Degrees of Freedom and Model Search

TWO L 1 BASED NONCONVEX METHODS FOR CONSTRUCTING SPARSE MEAN REVERTING PORTFOLIOS

Stochastic Inventory Control

Numerical Analysis Lecture Notes

(67902) Topics in Theory and Complexity Nov 2, Lecture 7

Introduction to Matrix Algebra

Inner products on R n, and more

Regularized Logistic Regression for Mind Reading with Parallel Validation

10. Proximal point method

Message-passing sequential detection of multiple change points in networks

3. Linear Programming and Polyhedral Combinatorics

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

7 Gaussian Elimination and LU Factorization

Transcription:

Nonconvex? NP! (No Problem!) Ryan Tibshirani Convex Optimization 10-725/36-725 1

Beyond the tip? 2

Some takeaway points If possible, formulate task in terms of convex optimization typically easier to solve, easier to analyze Nonconvex does not necessarily mean nonscientific! However, statistically, it does typically mean high(er) variance In more cases than you might expect, nonconvex problems can be solved exactly (to global optimality) 3

What does it mean for a problem to be nonconvex? Consider a generic optimization problem: min x subject to f(x) h i (x) 0, i = 1,... m l j (x) = 0, j = 1,... r This is a convex problem if f, h i, i = 1,... m are convex, and l j, j = 1,... r are affine A nonconvex problem is one of this form, where not all conditions are met on the functions But trivial modifications of convex problems can lead to nonconvex formulations... so we really just consider nonconvex problems that are not trivially equivalent to convex ones 4

What does it mean to solve a nonconvex problem? Nonconvex problems can have local minima, i.e., there can exist a feasible x such that f(y) f(x) for all feasible y such that x y 2 R but x is still not globally optimal. (Note: we proved that this could not happen for convex problems) Hence by solving a nonconvex problem, we mean finding the global minimizer We also implicitly mean doing it efficiently, i.e., in polynomial time 5

Addendum This is really about putting together a list of cool problems, that are suprisingly tractable... hence there will be exceptions about nonconvexity and/or requiring exact global optima (Also, I m sure that there are many more examples out there that I m missing, so I invite you to give me ideas / contribute!) 6

Outline Rough categories for today s problems: Classic nonconvex problems Eigen problems Graph problems Nonconvex proximal operators Discrete problems Infinite-dimensional problems Statistical problems 7

Classic nonconvex problems 8

Linear-fractional programs A linear-fractional program is of the form min x R n c T x + d e T x + f subject to Gx h, e T x + f > 0 Ax = b This is nonconvex (but quasiconvex). Provided that this problem is feasible, it is in fact equivalent to the linear program min y R n, z R c T y + dz subject to Gy hz 0, z 0 Ay bz = 0, e T y + fz = 1 9

The link between the two problems is the transformation y = x e T x + f, z = 1 e T x + f The proof of their equivlance is simple; e.g., see B & V Chapter 4 Linear-fractional problems show up in the study of solutions paths for many common statistical estimation problems 1.0 0.5 0.0 0.5 0 5 10 15 20 25 The knots in the lasso path (values of λ at which a coefficient is made nonzero) can be seen as the optimal values of linear-fractional programs E.g., see Taylor et al. (2013), Inference in adaptive regression via the Kac-Rice formula 10

Geometric programs A monomial is a function f : R n ++ R of the form f(x) = γx a 1 1 xa 2 2 xan n for γ > 0, a 1,... a n R. A posynomial is a sum of monomials, f(x) = p k=1 A geometric program of the form γ k x a k1 1 x a k2 2 x a kn n min subject to f(x) g i (x) 1, i = 1,... m h j (x) = 1, j = 1,... r where f, g i, i = 1,... m are posynomials and h j, j = 1,... r are monomials. This is nonconvex 11

12 This is equivalent to a convex problem, via a simple transformation. Given f(x) = γx a 1 1 xa 2 2 xan n, let y i = log x i and rewrite this as γ(e y 1 ) a 1 (e y 2 ) a2 (e yn ) an = e at y+b for b = log γ. Also, a posynomial can be written as p k=1 eat k y+b k. With this variable substitution, and after taking logs, a geometric program is equivalent to ( p0 ) min log e at 0k y+b 0k subject to log k=1 ( pi e at ik y+b ik k=1 ) c T j y + d j = 0, j = 1,... r 0, i = 1,... m This is convex, recalling the convexity of soft max functions

Many interesting problems are geometric programs; see Boyd et al. (2007), A tutorial on geometric programming, and also Chapter 8.8 of Floor B & planning V book 43 w i (x i,y i ) C i h i H W Figure 8.18 Floor planning problem. Non-overlapping rectangular cells are placed in a rectangle with width W,heightH, andlowerleftcornerat(0, 0). The ith cell is specified by its width w i,heighth i,andthecoordinatesofits lower left corner, (x i,y i). Extension to matrix world: Sra and Hosseini (2013), Geometric optimization on positive definite matrices with application to elliptically contoured distributions 13

14 Handling convex equality constraints Given convex f, h i, i = 1,... m, the problem min x subject to f(x) h i (x) 0, i = 1,... m l(x) = 0 is nonconvex when l is convex but not affine. A convex relaxation of this problem is min x subject to f(x) h i (x) 0, i = 1,... m l(x) 0 If we can ensure that l(x ) = 0 at any solution x of the above problem, then the two are equivalent

From B & V Exercises 4.6 and 4.58, e.g., consider the maximum utility problem T max α t u(x t ) x 0,...x T R b 1,...b T +1 R t=0 subject to b t+1 = b t + f(b t ) x t, t = 0,... T 0 x t b t, t = 0,... T where b 0 0 is fixed. Interpretation: x t is the amount spent of your total available money b t at time t; concave function u gives utility, concave function f measures investment return This is not a convex problem, because of the equality constraint; but can relax to b t+1 b t + f(b t ) x t, t = 0,... T without changing solution (think about throwing out money) 15

Problems with two quadratic functions Consider the problem involving two quadratics min x R n x T A 0 x + 2b T 0 x + c 0 subject to x T A 1 x + 2b T 1 x + c 1 0 Here A 0, A 1 need not be positive definite, so this is nonconvex. The dual problem can be cast as max u R, v R subject to u [ ] A0 + va 1 b 0 + vb 1 (b 0 + vb 1 ) T 0 c 0 + vc 1 u v 0 and (as always) is convex. Furthermore, strong duality holds. See Appendix B of B & V, see also Beck and Eldar (2006), Strong duality in nonconvex quadratic optimization with two quadratic constraints 16

Eigen problems 17

18 Principal component analysis Given a matrix X R n p, consider the nonconvex problem min X R R R 2 n p F subject to rank(r) = k for some fixed k. The solution here is given by the singular value decomposition of X: if X = UDV T, then ˆR = U k D k V T k, where U k, V k are the first k columns of U, V, and D k is the first k diagonal elements of D. I.e., ˆR is the reconstruction of X from its first k principal components This is often called the Eckart-Young Theorem, established in 1936, but was probably known even earlier see Stewart (1992), On the early history of the singular value decomposition

19 Fantope Another characterization of the SVD is via the following nonconvex problem, given X R n p : min X Z S XZ 2 p F subject to rank(z) = k, Z is a projection max Z S p XT X, Z subject to rank(z) = k, Z is a projection The solution here is Ẑ = V kv T k, where the columns of V k R p k give the first k eigenvectors of X T X This is equivalent to a convex problem. Express constraint set C as { } C = Z S p : rank(z) = k, Z is a projection { } = Z S p : λ i (Z) {0, 1} for i = 1,... p, tr(z) = k

20 Now consider the convex hull F k = conv(c): { } F k = Z S p : λ i (Z) [0, 1], i = 1,... p, tr(z) = k { } = Z S p : 0 Z I, tr(z) = k This is called the Fantope of order k. Further, the convex problem max Z S p XT X, Z subject to Z F k admits the same solution as the original one, i.e., Ẑ = V kv T k See Fan (1949), On a theorem of Weyl conerning eigenvalues of linear transformations, and Overton and Womersley (1992), On the sum of the largest eigenvalues of a symmetric matrix Sparse PCA extension: Vu et al. (2013), Fantope projection and selection: near-optimal convex relaxation of sparse PCA

Classical multidimensional scaling Let x 1,... x n R p, and define similarities S ij = (x i x) T (x j x). For fixed k, classical multidimensional scaling or MDS solves the nonconvex problem min z 1,...z n R k i,j ( ) 2 S ij (z i z) T (z j z) From Hastie et al. (2009), The elements of statistical learning 21

22 Let S be the similarity matrix (entries S ij = (x i x) T (x j x)) The classical MDS problem has an exact solution in terms of the eigendecomposition S = UD 2 U T : ẑ 1,... ẑ n are the rows of U k D k where U k is the first k columns of U, and D k the first k diagonal entries of D Note: other very similar forms of MDS are not convex, and not directly solveable, e.g., least squares scaling, with d ij = x i x j 2 : min z 1,...z n R k ( ) 2 dij z i z j 2 i,j See Hastie et al. (2009), Chapter 14

23 Generalized eigenvalue problems Given B, W S p, B, W 0, consider the nonconvex problem max v R n v T Bv v T W v This is a generalized eigenvalue problem, with exact solution given by the top eigenvector of W 1 B This is important, e.g., in Fisher s discriminant analysis, where B is the between-class covariance matrix, and W the within-class covariance matrix See Hastie et al. (2009), Chapter 4

Graph problems 24

25 Min cut Given a graph G = (V, E) with V = {1,... n}, two nodes s, t V, and costs c ij 0 on edges (i, j) E. Min cut problem: min b R E,x R V subject to (i,j) E b ij c ij b ij x i x j b ij, x i, x j {0, 1} for all i, j, x s = 0, x t = 1 Think of b ij as the indicator that the edge (i, j) traverses the cut from s to t; think of x i as an indicator that node i is grouped with t. This nonconvex problem can be solved exactly using max flow (max flow/min cut theorem)

26 A relaxation of min cut min b R E,x R V subject to (i,j) E b ij c ij b ij x i x j for all i, j b 0 x s = 0, x t = 1 This is an LP; recall that it is the dual of the max flow LP: max f R E (s,j) E f sj subject to f ij 0, f ij c ij for all (i, j) E f ik = f kj for all k V \ {s, t} (i,k) E (k,j) E Max flow min cut theorem tells us that the relaxed min cut is tight

e P Shortest paths Given a graph G = (V, E), with edge costs c e, e E, consider the shortest path problem, between two nodes s, t V min c e min c e paths P P =(e 1,...e r) Dijkstra s algorithm solves this problem (and more), from Dijkstra (1959), A note on two problems in connexion with graphs e P subject to e 1,1 = s, e r,2 = t e i,2 = e i+1,1, i = 1,... r 1 Clever implementations run in O( E log V ) time; e.g., see Kleinberg and Tardos (2005), Algorithm design, Chapter 5 27

Nonconvex proximal operators 28

Hard-thresholding One of the simplest nonconvex problems, given y R n : min β R n n n (y i β i ) 2 + λ i 1{β i 0} i=1 i=1 Solution is given by hard-thresholding y, { y i if yi 2 β i = > λ i 0 otherwise, i = 1,... n and can be seen by inspection. Special case λ i = λ, i = 1,... n, min β R n y β 2 2 + λ β 0 Compare to soft-thresholding, prox operator for l 1 penalty. Note: changing the loss to y Xβ 2 2 gives best subset selection, which is NP hard for general X 29

30 l 0 segmentation Consider the nonconvex l 0 segmentation problem min β R n n n 1 (y i β i ) 2 + λ 1{β i β i+1 } i=1 i=1 Can be solved exactly using dynamic programming, in two ways: Bellman (1961), On the approximation of curves by line segments using dynamic programming, and Johnson (2013) A dynamic programming algorithm for the fused lasso and L 0 -segmentation Johnson: more efficient, Bellman: more general Worst-case O(n 2 ), but with practical performance more like O(n)

31 Tree-leaves projection Given target u R n, tree g on R n, and label y {0, 1}, consider min u z R z 2 n 2 + λ 1{g(z) y} Interpretation: find z close to u, whose label under g is not unlike y. Argue directly that solution is either ẑ = u or ẑ = P S (u), where S = g 1 (1) = {z : g(z) = y} the set of leaves of g assigned label y. We simply compute both options for ẑ and compare costs. Therefore problem reduces to computing P S (y), the projection onto a set of tree leaves, a highly nonconvex set This appears as a subroutine of a broader algorithm for nonconvex optimization; see Carreira-Perpinan and Wang (2012), Distributed optimization of deeply nested systems

The set S is a union of axis-aligned boxes; projection onto any one box is fast, O(n) operations 32

33 To project onto S, could just scan through all boxes, and take the closest Faster: decorate each node of tree with labels of its leaves, and bounding box. Perform depthfirst search, pruning nodes that do not contain a leaf labeled y, or whose bounding box is farther away than the current closest box

Discrete problems 34

35 Binary graph segmentation Given y R n, and a graph G = (V, E), V = {1,... n}, consider binary graph segmentation: min β {0,1} n n (y i β i ) 2 + λ ij 1{β i β j } i=1 (i,j) E Simple manipulation brings this problem to the form max a i + b j A {1,...n} i A j A c (i,j) E, A {i,j} =1 which is a segmentation problem that can be solved exactly using min cut/max flow. E.g., Kleinberg and Tardos (2005), Algorithm design, Chapter 7 λ ij

36 E.g., apply recursively to get a verison of graph hierarchical clustering (divisive) E.g., take the graph as a 2d grid for image segmentation (From http://ailab.snu. ac.kr)

37 Discrete l 0 segmentation Now consider discrete l 0 segmentation: min β {b 1,...b k } n n n 1 (y i β i ) 2 + λ 1{β i β i+1 } i=1 where {b 1,... b k } is some fixed discrete set. This can be efficiently solved using classic (discrete) dynamic programming Key insight is that the 1-dimensional structure allows us to exactly solve and store ˆβ 1 (β 2 ) = ˆβ 2 (β 3 ) =... argmin β 1 {b 1,...b k } argmin β 2 {b 1,...b k } i=1 (y 1 β 1 ) 2 + λ 1{β 1 β 2 } } {{ } f 1 (β 1,β 2 ) f 1 ( ˆβ1 (β 2 ), β 2 ) + (y2 β 2 ) 2 + λ 1{β 2 β 3 }

38 Algorithm: Make a forward pass over β 1,... β n 1, keeping a look-up table; also keep a look-up table for the optimal partial criterion values f 1,... f n 1 Solve exactly for β n Make a backward pass β n 1,... β 1, reading off the look-up table β 1 β 2... β n 1 f 1 f 2... f n 1 b 1 b 2... b k b 1 b 2... b k Requires O(nk) operations

Infinite-dimensional problems 39

Smoothing splines Given pairs (x i, y i ) R R, i = 1,... n, smoothing splines solve min f n i=1 ( yi f(x i ) ) 2 + λ (f ( k+1 2 ) (t) ) 2 dt for a fixed odd k. The domain of minimization here is all functions f for which (f ( k+1 2 ) (t)) 2 dt <. Infinite-dimensional problem, but convex (in function space) Can show that the solution ˆf to the above problem is unique, and given by a natural spline of order k, with knots at x 1,... x n. This means we can restrict our attention to functions f = n θ j η j j=1 where η 1,... η n are natural spline basis functions 40

41 Plugging in f = n j=1 θ jη j, transform smoothing spline problem into finite-dimensional form: min θ R n y Nθ 2 2 + λθ T Ωθ where N ij = η j (x i ), and Ω ij = η ( k+1 2 ) i (t) η ( k+1 j solution is explicitly given by ˆθ = (N T N + λω) 1 N T y 2 ) (t) dt. The and fitted function is ˆf = n j=1 ˆθ j η j. With proper choice of basis function (B-splines), calculation of ˆθ is O(n) See, e.g., Wahba (1990), Splines models for observational data ; Green and Silverman (1994), Nonparametric regression and generalized linear models ; Hastie et al. (2009), Chapter 5

42 Locally adaptive regression splines Given same setup, locally adaptive regression splines solve min f n ( yi f(x i ) ) 2 + λ TV(f (k) ) i=1 for fixed k, even or odd. The domain is all f with TV(f (k) ) <, and again this is infinite-dimensional but convex Again, can show that a solution ˆf to above problem is given by a spline of order k, but two key differences: Can have any number of knots n k 1 (tuned by λ) Knots do not necessarily coincide with input points x 1,... x n See Mammen and van de Geer (1997), Locally adaptive regression splines ; in short, these are statistically more adaptive but computationally more challenging than smoothing splines

43 Mammen and van de Geer (1997) consider restricting attention to splines with knots contained in {x 1,... x n }; this turns the problem into finite-dimensional form, min θ R n y Gθ 2 2 + λ n j=k+2 θ j where G ij = g j (x i ), and g 1,... g n is a basis for splines with knots at x 1,... x n. The fitted function is ˆf = n j=1 ˆθ j g j These authors prove that the solution of this (tractable) problem ˆf and of the original problem f differ by max x [x 1,x n] ˆf(x) f (x) d k TV ( (f ) (k)) k with the maximum gap between inputs. Therefore, statistically it is reasonable to solve the finite-dimensional problem

E.g., a comparison, tuned to the same overall model complexity: Smoothing spline Finite-dimensional locally adaptive regression spline The left fit is easier to compute, but the right is more adaptive (Note: trend filtering estimates are asymptotically equivalent to locally adaptive regression splines, but much cheapter to compute) 44

Statistical problems 45

46 Sparse underdetermined linear systems Suppose that X R n p has unit normed columns, X i 2 = 1, for i = 1,... n. Given y, consider the problem of finding the sparsest sparse linear solution min β β R p 0 subject to Xβ = y This is nonconvex and known to be NP hard, for a generic X. A natural convex relaxation is the l 1 basis pursuit problem: min β β R p 1 subject to Xβ = y It turns out that there is a deep connection between the two; we cite results from Donoho (2006), For most large underdetermined systems of linear equations, the minimal l 1 norm solution is also the sparsest solution

47 As n, p grow large, p > n, there exists a threshold ρ (depending on the ratio p/n), such that for most matrices X, if we solve the l 1 problem and find a solution with: fewer than ρn nonzero components, then this is the unique solution of the l 0 problem greater than ρn nonzero components, then there is no solution of the linear system with less than ρn nonzero components (Here most is quantified precisely in terms of a probability over matrices X, constructed by drawing columns of X uniformly at random over the unit sphere in R n ) There is a large and fast-moving body of related literature. See Donoho et al. (2009), Message-passing algorithms for compressed sensing for a nice review

48 Nearly optimal K-means Given data points x 1,... x n R p, the K-means problem solves 1 min c 1,...c K R p n n i=1 min k=1,...k x i c k 2 2 } {{ } f(c 1,...c K ) This is NP hard, and is usually approximately solved using Lloyd s algorithm, run many times, with random starts Careful choice of starting positions makes a big impact: running Lloyd s algorithm once, from c 1 = s 1,... c K = s K, for cleverly chosen random s 1,... s K, yields estimates ĉ 1,... ĉ K satisfying E [ f(ĉ 1,... ĉ K ) ] 8(log k + 2) min f(c 1,... c K ) c 1,...c K R p

49 See Arthur and Vassilvitskii (2007), k-means++: The advantages of careful seeding. In fact, their construction of s 1,... s K is very simple: Begin by choosing s 1 uniformly at random among x 1,... x n Compute squared distances d 2 i = x i s 1 2 2 for all points i not chosen, and choose s 2 by drawing from the remaining points, with probability weights d 2 i / j d2 j Recompute the squared distances as d 2 i = min { x i s 1 2 2, x i s 2 2 } 2 and choose s 3 according to the same recipe And so on, until s 1,... s K are chosen