More Constraints, Smaller Coresets: Constrained Matrix Approximation of Sparse Big Data
|
|
|
- Aldous Cunningham
- 10 years ago
- Views:
Transcription
1 More Constraints, Smaller Coresets: Constrained Matrix Approximation of Sparse Big Data ABSTRACT Dan Feldman Computer Science Department University of Haifa Haifa, Israel We suggest a generic data reduction technique with provable guarantees for computing the low rank approximation of a matrix under some l z error, and constrained factorizations, such as the Non-negative Matrix Factorization (NMF). Our main algorithm reduces a given n d matrix into a small, -dependent, weighted subset C of its rows (known as a coreset), whose size is independent of both n and d. We then prove that applying existing algorithms on the resulting coreset can be turned into (1 + )-approximations for the original (large) input matrix. In particular, we provide the first linear time approximation scheme (LTAS) for the rank-one NMF. The coreset C can be computed in parallel and using only one pass over a possibly unbounded stream of row vectors. In this sense we improve the result in [4] (Best paper of STOC 2013). Moreover, since C is a subset of these rows, its construction time, as well as its sparsity (number of nonzeroes entries) and the sparsity of the resulting low rank approximation depend on the maximum sparsity of an input row, and not on the actual dimension d. In this sense, we improve the result of Libery [21] (Best paper of KDD 2013) and answer affirmably, and in a more general setting, his open question of computing such a coreset. We implemented our coreset and demonstrate it by turning Matlab s NMF off-line function that gets a matrix in the memory of a single machine, into a streaming algorithm that runs in parallel on 64 machines on Amazon s cloud and returns sparse NMF factorization. Source code is provided for reproducing the experiments and integration with existing and future algorithms. 1. INTRODUCTION Matrix Factorization is a fundamental problem that was independently introduced in different contexts and applications. In Latent Semantic Analysis (LSA) or Probabilistic LSA (PLSA), the rows of an input matrix A correspond to documents in a large corpus of data, its columns corre- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. KDD 15, August 10-13, 2015, Sydney, NSW, Australia. c 2015 ACM. ISBN /15/08...$ DOI: Tamir Tassa Dep. of Mathematics and Computer Science The Open University Ra anana, Israel [email protected] spond to terms that may appear in those documents, and the (i, j)th entry in A denotes the frequency of the jth term in the ith document (or other nonnegative statistics like tfidf). Similarly, in computer vision applications the (i, j)th entry in A may represent the frequency of the ith feature in the jth image, and in social networks A may be the adjacency matrix of the graph. An approximate k-ranked factorization such as the thin Singular Value Decomposition (SVD) of A enables to extract the k most important topics that are represented in that data and, by thus, obtain a concise representation of the data. In nonnegative matrix approximation, each topic is a linear combination of terms (columns of A), with only nonnegative coefficients. Such techniques have been applied, for example, to image segmentation [20], information retrieval [15] and document clustering [26]. In this paper we deal with matrix factorization problems and their constrained variants. These problems can be described geometrically as computing a low-dimensional subspace S that minimizes the sum of distances in some power z 1 to a given set P of n points in a d-dimensional p P dist(p, S)z, where dist(p, S) := Euclidean space, min x S p x 2. The input points are the rows of an n d matrix A, and their projections on the subspace are the low rank approximation. 1.1 Problem Statement Optimal (k, z)-subspace. For a given matrix A R n d and a pair of integers k, z 1, the optimal (k, z)-subspace of A is the k-dimensional subspace S of R d that minimizes the sum of distances to the power of z from the set of rows P = {p 1,, p n} of A, i.e., S minimizes cost z(p, S) := p P (dist(p, S))z. A (1+)-approximation to this optimal (k, z)-subspace is a k-subspace S that minimizes cost z(p, S), up to a multiplicative factor of 1 +, cost z(p, S) (1 + )cost z(p, S ). Coreset. For a given > 0, a (k, z, )-coreset C for the matrix A is a subset of scaled rows from A such that using only C (without A) we can compute a (1+)-approximation the optimal (k, z)-subspace of A. In particular, if C is small, we can apply a (possibly inefficient) algorithm on C to get a (1 + )-approximation for the optimal (k, z)-subspace of A. Our main motivation is to compute such small coresets for obtaining the optimal subspace of A also under some additional constraints, as in the NMF problem. Non-Negative Matrix Factorization (NMF). In one of the variants of the Non-Negative Matrix Factorization
2 problem we wish to project the rows of A on a line l that minimizes their sum of distances to this line, with the additional constraint that l is spanned by a vector u whose entries are non-negative, i.e, l = Sp(u) and u L 0 where { } L 0 = (x 1,, x d ) R d : x 1,, x d 0 is the non-negative orthant of R d. If the rows of A have only non-negative entries (as is common in many applications), then their projections on l will also be non-negative. This results in a factorization au T R n d that approximates A where a and u are non-negative column vectors. (L, k, z)-subspace optimization. A natural generalization of NMF is the (L, k, z)-subspace optimization problem: given a convex cone L in R d, one looks for a k-dimensional subspace SL(P ) that minimizes cost z(p, S) over every k- dimensional subspace S that is spanned by k vectors in L. In the special case k = 1 we denote SL(P ) by l L(P ). A (1 + )-approximation S to the optimal (L, k, z)-subspace is a subspace S that is spanned by k vectors in L such that cost z(p, S) (1 + )cost z(p, S ). 1.2 Related Work Coresets. Following a decade of research, coresets (i.e., subsets of rows from the input matrix) that approximate any subspace in R d were suggested in [6, 24] for sums of distances to the power of z 1. However, their size is C = O(dk O(z) / 2 ), i.e, polynomial in d, which makes them less suitable for common large matrices (documents-terms, images-features, and adjacency matrices of graphs) where d n. The dependency on d is due to the complexity, known as pseudo-dimension, of this family of subspaces. Our work here is based on these results, but we show how to remove the d factor by reducing the family of subspaces and using a post-processing algorithm for extracting the desired subspace from the coreset. For the case z = 2, a deterministic coreset construction of size independent of d was very recently suggested in [11]. However, the construction is heavily tailored for this case of squared distances and SVD, unlike the above constructions and the construction in our paper. To our knowledge, the only coreset for constrained factorization was given in [3] by Boutsidis and Drineas. They concentrated on the regression problem of minimizing Ax b over positive x R d. They used a coreset of size roughly d log(n)/ 2 to approximate this problem, and thus it is useful only if d n. In addition, the coreset is proved to approximate only the optimal solution, and it is not clear how to apply it for the parallel or streaming setting without the required guarantee for merging a pair of coresets. It was proved in [23] that for L = R d and every z 1, the optimal (L, k, z)-subspace is spanned by poly(k, 1/) input rows, using a non-constructive proof of existence. Our work generalizes this result to the constrained subspace case, and implies efficient constructions of these spanning sets. Other weaker versions of coresets appear in [7] in the context of k-means clustering. Sketches. A sketch in our context is a set of vectors in R d that can be used to compute a (1 + )-approximation to the optimal subspace of a given matrix. Unlike coresets, if the input matrix is sparse, the sketch is in general not sparse. A sketch of cardinality d that approximates the sum of squared distances (z = 2) for any subspace in R d can be constructed with no approximation error ( = 0) using the d rows of the matrix DV T where UDV T = A is the thin SVD of A. It was proved in [10] that taking the first O(k/) rows of DV T yields such a sketch. This sketch can be constructed in both the streaming and parallel models by replacing with 1/ log n in the time and memory requirements. In [21] (Best paper of KDD 2013) the above log n factor was removed from the streaming setting, as shown by the analysis of [13]. In [21, 13] the open question of whether we can construct such coresets (i.e., subsets of the input rows) of size independent of d was left. In this paper we answer this question affirmably, also for z 1 and constrained optimization. Sketches for z 2 of size polynomial in d were suggested in [9, 5]. First sketch that can be constructed in input sparsity time was suggested in [4] (Best paper of STOC 2013). However, the sketch is dense and thus not applicable for merge-reduce techniques, and can not be computed in parallel or for streaming big data. Non-negative Matrix Factorization (NMF). Unlike the optimal (unconstrained) subspace problem, Vavasis [25] has shown that for general values of k, NMF is NP-hard, even for z = 2 and L = L 0. In fact, there are almost no provable results in the study of the NMF problem. A notable exception is the paper by Arora et al. [1] which describes a solution under assumptions that may be hard to verify in practice and runs in time 2 (log( 1 )O(1) ) poly(n, d) where poly(n, d) is some implicit polynomial in n and d for k = 1. We suggest a PTAS that takes time O(nd 2 )+O(d 2 1/O(1) ) for this case and makes no assumptions on the input. For the case k > 1 running the algorithm in [1] on our corresponding coreset would improve their running time to O(nd 2 ) for constants k and. To our knowledge, all other papers that studied the NMF problem proposed heuristics for solving it based on greedy and rank-one downdating e.g. [2, 15, 16, 17, 18, 19, 20]. In the downdating approach, similarly to the SVD recursive algorithm for computing the (unconstrained) k-rank approximation, existing heuristics solve the (L, k, z)-subspace optimization problem by applying a solution to the optimal (L, 1, z)-subspace (line) in k iterations: one starts by solving the problem for k = 1, and using the obtained solution, the algorithm then proceeds to update the input matrix and then solve another 1-dimensional problem on the updated matrix. All these off-line algorithms can now be applied on our small coresets to boost both their running time and performance. For example, these heuristics usually run a lot of iterations from an initial guess (seed) until they converge to a local minimum, and then repeat on other seeds for obtaining possibly better local minima. Running them on the coresets will allow using more iterations and seeds in less time, which in turn might improve the quality of the result. We can also run such off-line algorithms on streaming data and in parallel, by applying them on the coresets for the data seen so far as explained in [10]. 1.3 Our results Coresets for low-rank approximation. We prove that for every given matrix A R n d, an error parameter > 0, and an integer z 1, there is a (k, z, )-coreset C of size
3 k O(z) log( 1 ). In particular, unlike previous results, the size of the coreset C is independent of the input matrix size, d and n, and thus its sparsity depends on the maximum sparsity of each row and not on the size of A. Such a result is currently known only for the special case z = 2 [11]. In order to extract the approximated optimal subspace of A from the coreset C, one needs to compute the optimal subspace S of C. This subspace is not necessarily a good approximation to the optimal subspace of A. However, by applying our post-processing with C, S and L = R d as its input yields the desired approximation (see Algorithm 1). Coresets for constrained optimization. As in most of the existing algorithms for NMF (see related work), we use an algorithm for solving the case k = 1 in order to solve the more general (L, k, z)-subspace optimization for any k > 1. We prove that any such algorithm that computes a (1 + )-approximation for the (L, 1, z)-subspace of a given matrix A, can be applied on our coreset C to obtain a (1 + )-approximation l to the optimal (L, 1, z)-subspace of A. This property of the coreset holds simultaneously for every cone L of R d. The (L, 1, z)-subspace (line) l is computed by (i) applying the given approximation algorithm on the coreset C to get an output line l, and then (ii) rotate l for the given tuple (C, L,, l ) using our post-processing that returns the final approximated line l; see Algorithm 1. LTAS for constrained factorization. Unfortunately, although C is small, we could not find any polynomial time algorithm in the input that can provably compute the optimal subspace of C or its approximation, even for the case of Non-Negative Matrix Factorization where k = 1 and L = L 0. We thus design the first linear time approximation algorithm (LTAS) for computing such a (1 + )- approximation l to the optimal (L, 1, z)-subspace, for any given cone L and z 1. For the case z = 2 its running time is O(nd 2 ) + O(d 2 1/O(1) ) when applied on our coreset. Unlike previous results that apply only for the Frobenius norm (z = 2), and non-negativity constraints, our coreset and LTAS can be computed for a more general family of constraints and any z 1 to obtain a sparse approximation. Streaming and parallel coreset construction. Via traditional map-reduce or merge-and-reduce techniques, our coresets can be computed using one pass over a possibly unbounded stream of rows, and in parallel over M machines (e.g. GPU, network, smartphones or a cloud). The size of the coreset and the required memory, as well as the update time per point is similar to the off-line case, where is replaced by / log n and n is the number of rows seen so far; see [14, 10] for details. Using M machines in parallel, the running time is reduced by a factor of M. Experimental results. We implement both our coreset construction algorithm and the post-processing algorithm using Matlab. We then demonstrate how to boost this native off-line solver for NMF by applying it on our coreset and then using the post-processing algorithm on the resulting subspace to obtain an approximated solution for the original data. Note that we boost Matlab s solver only for demonstration: our coreset and post-processing algorithm can be applied on any existing or future heuristic that aims to compute approximation for a constrained factorization such as the NMF. Only the call to the Malab function should be replaced in our test code. Although Matlab s NMF solver does not support parallel or streaming computation, by combining it with our coreset and the merge-and-reduce technique, we were able to compute the NMF of few datasets on 64 machines on Amazon cloud. On small datasets we obtained even better approximation compared to its result on the original (complete) data. While this phenomenon can not happen for running optimal (exact) solution, it usually occurs in the context of coresets. Open source code. An open source code can be found in [12] for reproducing the experiments and applying the coreset for boosting existing and future algorithms. 1.4 Overview and organization In Section 2 we provide the necessary background on coresets and give a birdseye view of our approach. In Section 3 we prove our main technical result: for the optimal constrained line ((L, 1, z)-subspace) that is spanned by L, there is always an approximated line that is spanned by few vectors in the union of L and the input points. In Section 4 we show the coreset construction which is similar to previous constructions except for the fact that its size is smaller and independent of d, and it approximates the constrained optimal subspace and not every subspace. Using exhaustive search to compute the optimal solution from the coreset we then obtain the first LTAS for the NMF problem. In Section 5 we demonstrate the efficiency of our algorithm and evaluate the quality of the results that it issues by replacing the LTAS by existing NMF heuristic. We conclude in Section 6. Due to space limitations, the proofs of most of the claims are omitted; they will be provided in the full version of this paper. 2. CORESETS 2.1 A General Framework for Coresets In this section we summarize our main technical result and the new approach for computing coresets for constrained optimization problems. It is based on the general framework for coreset constructions from [6]. That framework assumes that we have a set P of n items that is called the input set, a (usually infinite) set Q which is called the family of queries, and a function dist that assigns every pair p P and q Q a non-negative real number. For example, in the optimal (1, 1)-subspace problem, P is a set of n points in R d (the rows of an n d matrix), Q is the set of lines in R d, and dist(p, l) is the distance from p P to the line l Q. Given an error parameter (0, 1), the main algorithm of that framework outputs a pair (S, w), called an -coreset, which consists of a subset S P and a weight function w : S [0, ) over the points in S such that the following holds. For every query q Q, the sum of distances from P to q is approximated by the sum of weighted distances from C to q, up to 1 ± factor, i.e., (1 ) p P (1 + ) p P dist(p, q). dist(p, q) p S w(p)dist(p, q) It is proved in [6] that such a coreset (S, w) can be constructed by taking a non-uniform sample from P, where each sample s S is chosen to be p P with probability that is proportional to its sensitivity (sometimes called importance
4 or leverage score) dist(p, q) σ(p) := max q Q p P dist(p, q). Intuitively, if there is a query q Q where the distance from p to q dominates the sum of distances, then it is more important to select p to the coreset. Of course, the sensitivity of a point is at most 1, but we hope that not all points are important. The assigned weight for a sampled point s = p is inversely proportional to σ(p). It is then proved that given a probability of failure δ (0, 1), this sampling procedure yields an -coreset (S, w) with probability at least 1 δ, where the size of S is S = O(Σ 2 / 2 )(v + log(1/δ)). Here, Σ := p P σ(p) is called the total sensitivity and v is called the pseudo-dimension of the pair (P, Q) which represents the complexity of this pair in a VC dimension-type sense that is defined formally in [6]. Usually, (but not always) v is the number of parameters needed to define a query. For example, the family of lines in R d has pseudo-dimension d; indeed, every such line can be defined by O(d) parameters that represents its O(d) degrees of freedom (direction and translation from the origin). 2.2 Weak Coresets To obtain a coreset for the optimal (k, z)-subspace problem, we let P be the input set of n points in R d, Q be the set (family) of k-subspaces in R d, and dist z(p, X) = min x X p x z, where p P and X Q. It was proved in [24] that the total sensitivity for this setting is independent of both n and d (see proof of Theorem 4.1 for details), but its pseudo-dimension is O(d). Note that in order to compute an approximation to the optimal (k, z)-subspace for P using the coreset, we only need that the optimal (k, z)- subspace for the coreset will be a good approximation to the optimal (k, z)-subspace for P. In this sense, the requirement made in [24] that an -coreset must approximate all subspaces in Q is too strong. In order to reduce the pseudo-dimension, a generalized version of a pseudo-dimension was suggested in [6]. It was used there to construct what is sometimes called a weak coreset [7, 8]. The resulting coreset C is weak in the sense that instead of approximating every query it approximates only a subset of queries Q(C) Q that depends on the output coreset C. The generalized pseudo-dimension is then defined in [6] for such a function Q that assigns a subset Q(S) Q for every subset S P. 2.3 Novel approach: more constraints, smaller coresets Variants of weak coresets for subspace approximation were used in [8, 6]. However, the size of the coreset in [8] was exponential in k, and, furthermore, the original input set P was needed to extract the approximate subspace from the coreset. In [6] the coreset was based on recursively projecting the points on their (k, z)-optimal subspace, for all 1 k k, and thus the output was a sketch (and not a subset of P ). In addition, those weak coresets can not be used to approximate the constrained optimal subspace, because such an approximation may not be spanned by few input points. To this end, we prove that a (1 + )-approximation to the constrained (L, 1, z)-subspace is spanned by a small set which is not a subset of the input, but still depends on only O(2 O(z) /) input points and their projections on the cone L that represents the constraints. This implies a generalized pseudo-dimension of size independent of d for the constrained (L, k, z)-subspace problem. Based on the previous subsection and the bound on the total sensitivity from [24, 6] we thus obtain a coreset construction C of size independent of d. While the construction is very similar to this in [24, 6] (for strong coresets), we can not simply apply an existing approximation algorithm on C, since the optimal (L, k, z)- subspace S for C is not necessarily a good approximation to the optimal (L, k, z)-subspace for P. This is because S may not be spanned by few points, i.e., S Q(C). Instead, we need to rotate S to obtain another subspace S that is (i) a member in the set Q(C) of queries that are approximated well in P, and (ii) approximates the optimal (k, z)-subspace S of C. Indeed, our proof that there is a (1 + )-approximation to the constrained (L, k, z)-subspace is constructive. It gets as input an optimal subspace S that was computed for the small coreset C, and carefully and iteratively rotates it in order to obtain a subspace S that satisfies properties (i) and (ii) above. The subspace S can be computed using exhaustive search on C, which yields the first PTAS for the rank-1 NMF problem; See Theorem 4.2. Alternatively, we can apply existing heuristics to compute a k-subspace S that hopefully approximates the (L, k, z)-subspace S, and then apply our postprocessing rotation; see Theorem 4.3. For simplicity, in the next sections we focus on coresets for the case z = 2. The generalizations for z 2 can be obtained by using the results from [24] that bound the total sensitivity for these cases. The generalization of our postprocessing algorithm for z 2 is also straightforward using the generalization of the triangle inequality to the weak triangle inequality (or Hölder inequality) as explained in the appendix of [23]. 3. CONSTRAINED OPTIMIZATION The constraints on the approximating subspaces have a geometric representation: the approximating subspace must have a basis of vectors that belong to some convex cone L in R d. A convex cone is a collection of half-lines in R d that start at the origin, which forms a convex subset of R d ; e.g., L 0 is a convex cone, and the corresponding constrained problem is NMF. See an illustration of a cone in Figure 1. Let L be a convex cone, P be a set of points in R d, l = l L(P ) be the solution to the (L, 1, 2)-subspace optimization problem, and > 0. In this section we construct a PTAS for finding a line l that approximates l in the sense that cost(p, l) (1 + )cost(p, l ), (1) where hereinafter cost := cost z for z = 2 (namely, cost(p, l) = p P (dist(p, l))2 ). First (Section 3.3) we show that any arbitrary line l (not necessarily l = l L(P )) has a line l that approximates it in the sense of inequality (1) and is spanned by a sparse subset of P ; the construction of that sparse subset and of l uses l as an input. We then use this result in order to approximate the optimal line l = l L(P ) even though it is not known upfront (Section 3.4).
5 We begin with some preliminary discussions in Sections In what follows when we speak of lines in R d we refer only to subspaces, namely lines that pass through the origin o of R d. 3.1 Projection of lines An important notion in our analysis is that of a projection of a line onto a convex cone. Definition 3.1 (L-projection). For any two lines in R d, l 1 and l 2, if α is the angle between them, β(l 1, l 2) := sin α. The function η( ) is called an L-projection if it gets a line l and returns a line η(l) L, such that β(η(l), ˆl) β(l, ˆl) for all ˆl L. See in Figure 1 an example of a line l, its projection η(l) on the depicted cone, and the angle between η(l) and l, the sinus of which is denoted by β(η(l), l). Lemma 3.2. Let L be a convex cone. Let η( ) be a function that gets a line l in R d and returns a line l = η(l) L that minimizes β(l, l) in L, where ties are broken arbitrarily. Then η is an L-projection. Figure 1: A cone and a projection of a line on the cone (Definition 3.1). Lemma 3.2 can be used to compute an L-projection η for a given convex cone L. In what follows, we assume that η(l) is computable in time O(d) for any l in R d. 3.2 Centroid sets The following definitions play a central role in our analysis that follows. Definition 3.3. Let l and l be lines in R d and let 0 < 1. Denote by α and π α the two angels between l and l. Let F α(l, l, ) (resp. F π α(l, l, )) denote the set of 1/ +1 lines in R d that includes l and l and partitions the angle α (resp. π α) to 1/ equal parts. Then G(l, l, ) := F α(l, l, ) F π α(l, l, ). An illustration of the fan of lines G(l, l, ) is given in Figure 2. Definition 3.4 (Centroid Set). Let Q = (q 1,, q m) be a sequence of m points from R d \ {o}, Figure 2: An illustration of the fan of lines defined in Definition 3.3. L be a convex cone, η be an L-projection, and > 0. Set G 1 = {Sp(q 1)} and then, for every 1 i m 1 define G i+1 = l i G i G(η(l i), Sp(q i+1), ). Then Γ(Q, η, ) := l G m η(l) is called a centroid set. Lemma 3.5. The size of the set Γ(Q, η, ) is O( d and the time to compute it is O( m 1 ). 1 ), m Approximating an arbitrary line l Let P be a finite set of n points in R d \ {o}, L be a convex cone and η be an L-projection. Assume further that l is any line in L and that 0 < < 1/2. Algorithm RotateLine accepts inputs of the form (P, L,, l ) and outputs a line l which approximates l in the sense of inequality (1). The line l belongs to L and it is contained in a centroid set (Definition 3.4) that could be described using a small (sparse) subset of points from P, where the size of that subset does not depend on n nor on d (Theorem 3.7). Algorithm 1 RotateLine(P, L,, l ) Input: A set P = {p 1,, p n} in R d ; a convex cone L, (0, 1 2 ), and a line l L. Output: A line l L that approximates l in the sense of inequality (1), where l Γ(Q, η, 2 /216) for some sequence Q of points from P. 1: Set q 1 to be a point q P that minimizes β(sp(q), l ). 2: Set l 1 := Sp(q 1). 3: Return Improve(P, L,, l, l 1) Algorithm 2 Improve(P, L,, l, l i) 1: Set l := η(l i). 2: if cost(p, l) (1 + ) cost(p, l ) then 3: Return l. 4: end if 5: Compute a point q i+1 P such that dist(q i+1, l) > (1 + ) dist(qi+1, 3 l ). 6: Compute a line l i+1 G(l, Sp(q i+1), 2 ) such that 216 β(l i+1, l ) (1 )β(l, 12 l ). 7: Return Improve(P, L,, l, l i+1). Algorithm RotateLine calls the recursive sub-algorithm Improve. The computations in Steps 5 and 6 of Algo-
6 rithm Improve are well defined. Indeed, if we reach Step 5, then cost(p, l) > (1 + ) cost(p, l ). Therefore, there exists at least one point q i+1 P for which dist(q i+1, l) 2 > (1 + )(dist(q i+1, l )) 2. Hence, dist(q i+1, l) > (1 + ) 1/2 dist(q i+1, l ) > (1 + )dist(qi+1, 3 l ), where the latter inequality follows from our assumption that < 1. As for 2 Step 6, a line l i+1 as sought there is guaranteed to exist in view of the next lemma. Lemma 3.6. Let l, l be two lines in R d. Let 0 < 1 and q R d be such that dist(q, l ) > (1 + ) dist(q, l ). (2) Then there exists a line l G(l, Sp(q), 2 /24) such that β(l, l ) (1 /4) β(l, l ). Theorem 3.7. Assume that Algorithm RotateLine is applied on the input (P, L,, l ). Then: (i) The algorithm stops after at most m() = log 3 ( ) log + 1 = O ( 1 log recursive calls, and outputs a line l L. ( )) 1 (ii) The overall runtime of the algorithm is O( nd log 1 ) + O( d log 1 ). 3 (iii) cost(p, l) (1 + ) cost(p, l ). (iv) There exists a sequence Q of at most m = m() points from P, such that l Γ(Q, η, 2 /216). Proof of (i). Let us denote α(l ) := β(l, l ) for every line l in R d. Fix i, 1 i m, and consider the value of l during the ith recursive call of the algorithm. As implied by Step 1 of Improve, we have α(l) = α(η(l i)). Therefore, by the properties of η, see Definition 3.1, we have α(l) α(l i). By Step 6 of Improve we have α(l i+1) (1 )α(l). 12 Combining the last two inequalities yields α(l i+1) (1 )α(li). Hence, by Eq. (3), 12 α(l m) (3) ( 1 ) m 1 α(l1) α(l1). (4) 12 3 We shall now prove that for every fixed p P, dist(p, l m) (1 + 3 ) dist(p, l ). By the triangle inequality, dist(p, l m) dist(p, l ) + dist(proj(p, l ), l m) = dist(p, l ) + proj(p, l ) α(l m) dist(p, l ) + p α(l m). By the definition of l 1 in Algorithm RotateLine, by the properties of η as an L-projection, and by the choice of q 1, we have (5) α(l 1) = α(η(sp(q 1))) α(sp(q 1)) α(sp(p)). (6) By (4) and (6), we infer that α(l m) α(sp(p))/3. Using the last inequality with (5) yields dist(p, l m) dist(p, l ) + p α(sp(p))/3 ( = 1 + ) dist(p, l ). 3 (7) Squaring inequality (7) and summing over p P yields cost(p, l m) = (dist(p, l m)) 2 p P ( 1 + ) 2 (dist(p, l )) 2 < (1 + ) cost(p, l ), 3 p P where the last inequality holds for all < 1. Hence, if 2 Algorithm Improve did not stop before the mth recursive call, it would stop then. That concludes the proof of the first claim of the theorem. 3.4 Approximating l = l L(P ) In Section 3.3 we showed that every given line l has an approximating line l that can be described by a sparse subset of P. The construction of that sparse subset and of l uses l as an input. However, we wish to approximate the line l = l L(P ) which is not known upfront. The claims of Theorem 3.7 enable us to overcome this problem. That theorem guarantees that l has an approximating line l in Γ(Q, η, 2 /216), where Q is some sequence of at most m (Eq. (3)) points from P and η is any L-projection. Hence, we infer that l can be found in the set of lines ˆΓ(P, η, ) which is defined as follows. Definition 3.8. Let P be a set of points in R d \ {o}, L be a convex cone, η be an L-projection, and > 0. Then ˆΓ(P, η, ) := Q Γ(Q, η, 2 /216), where Γ is as defined in Definition 3.4, and the union is taken over all sequences Q = (q 1,..., q m) of at most m = m() (see Eq. (3)) points from P. Theorem 3.9. There exists a PTAS for l = l L(P ). Specifically, a line that approximates l L(P ) in the sense of inequality (1) can be found in time O(d poly(n)) for any constant > CORESET CONSTRUCTION Algorithm 3, which we present and analyze below, describes how to construct a coreset C for the input set of points P and parameters (0, 1/2) and δ (0, 1]. A generalized version for distances to the power of z 1 and for k 1 is described in [24]. However, unlike [24], the size of C is independent of d. To that end, the algorithm first constructs a pair (S, w), where S is a multiset of points from P (namely, it is a subset of P in which points may be repeated), and w is a weight function on S (Steps 1-12). Then, it converts the pair (S, w) into a set of points C in R d (Step 13). That set is called the coreset. The size of C is bounded by the size of S, which is shown in Step 7. That size depends on some sufficiently large constant, c 0, that can be determined from the analysis in this section and in the previous section. Our claims regarding the algorithm and the properties of the coreset C are summarized in Theorem 4.1 Theorem 4.1. Let P be a set of n points in R d, (0, 1 ), δ (0, 1], L be a convex cone, and η be an L- 2 projection. Let C be the set that Algorithm 3 outputs when applied on the input (P, ; δ). Then the following hold: (i) With probability at least 1 δ, every line l {l L(P )} ˆΓ(C, η, ) (Definition 3.8) satisfies (1 ) cost(p, l) cost(c, l) (1+) cost(p, l), (8)
7 Algorithm 3 Coreset(P, ; δ) Input: P = {p 1,, p n} R d, (0, 1/2), δ (0, 1]. Output: A set C of points in R d. 1: Use SVD to compute a line l that minimizes cost(p, l) := p P (dist(p, l))2 over all lines l in R d passing through the origin o. 2: For every p P : set p R d to be the projection of p onto l. 3: Set µ := 1 n p P p. 4: For every p P set m p := 2(dist(p, l )) 2 q P (dist(q, l )) + 4 p µ 2 2 q P q µ n 5: For every p P set Pr(p i) := m p q P mq. 6: Set S :=. 7: Set t := c0 3 log2 (1/) log(1/δ), where c 0 is a sufficiently large constant (that can be determined from the proof of Theorem 4.1). 8: Repeat t times: 9: Select a point p P according to the probability distribution Pr( ). 10: S := S {p}. 11: Set w(p) := n t Pr(p). 12: End Repeat 13: C := { j(p)w(p)p p S}, where j(p) is the number of times that p appears in S. 14: Return C. where cost(p, l) = p P (dist(p, l))2. (ii) C = O ( 1 3 log 2 ( 1 ) log( 1 δ )). (iii) The runtime of Coreset is O(nd 2 ) + O( 1 log 2 ( 1 ) log( 1 ) log n). 3 δ Proof of (i) (sketch). The proof is based on [6, Theorem 4.1] that shows how to construct a set of size O((v + log(1/δ)) t 2 / 2 ), (9) which is an -coreset with probability at least 1 δ. Here, v is the pseudo-dimension of the corresponding query space and t is its total sensitivity; see more details in Section 2. and full details in [6]. The total sensitivity t for the family of all possible k-dimensional subspaces of R d, and for distances to the power of z is t = O(k O(z) ) for z 1. [24, Theorem 19]. The pseudo-dimension of this family is v = O(d) and coresets of size linear in d that approximate all possible subspaces of R d can be constructed by substituting v = O(d) and the value of t above in (9). The coreset construction of [24] is essentially the same as in Algorithm 3 (for the case z = 2), except for the smaller size of the coreset that is output by our algorithm, which is independent of d. This is because our coreset aims to approximate only a constrained (smaller) subset of the family of lines, namely the lines in the centroid set Γ(Q, η, ) (Definition 3.4). This subset has pseudo-dimension v that depends on but not on d, as follows from Lemma 3.5. By Eq. (3), we conclude that the total sensitivity is t = O ( 1 log ( ) ( log 1 )) δ. Applying exhaustive search on the small coreset yields a polynomial time approximation scheme: Theorem 4.2. Let l = l L(P ) be the line that minimizes the sum of squared distances to the points of P over every line l in a convex cone L. Then, with probability at least 9/10, a line l such that cost(p, l) (1 + ) cost(p, l ), (10) can be computed in O(nd 2 ) + O(d 2 1/O(1) ). More generally, instead of applying our PTAS, we can use any existing heuristic that aims to compute the optimal line l. In our experimental results we show that this approach can work even without a provable guarantee for the heuristics, given the proof above for the reduction to coresets. If the chosen heuristic has an approximation guarantee, then we get similar guarantees on its outcome when applied on the coreset. This is stated in the following theorem. Theorem 4.3. Let C :=Coreset(P, ; δ) and let l be a line that approximates the optimal solution of the coreset C, i.e., l (1 + )l L(C). Then, with probability at least 1 δ, cost(p, l ) (1 + ) l L(P ). 5. EXPERIMENTAL RESULTS Hardware. We used Amazon EC2 AWS cloud service of 64 machines (workers) and its S3 storage support. The chosen machine type was Standard (cc2.8xlarge, 64 core). The service was run from a popular laptop Lenovo W520, CPU: Intel Core i7-3940xm, Memory: 32.0GB, DDR3-1600MHz SDRAM, Hard drive: 256GB 2.5 (SATA3) Mobility Solid State Drive. Intelő vproź) Software. All the code was written in Matlab 2013b (64bit), with the support of Matlab GPU toolbox (for most of the basic linear algebra functions). As for NMF heuristics, we use the nnmf function of Matlab s Statistics toolbox. This function supports mainly two heuristics: als (the default) uses an alternating least-squares algorithm, and mult that uses a multiplicative update algorithm. The other parameters of this function were assigned default values. The function returns a matrix H whose columns span a subspace of dimension at most k. Data. We used a public dataset from the UCI repository, called YouTube Multiview Video Games Dataset [22]. This dataset contains instances, each described by 91 feature types, with class information for exploring multiview topics. The number of records (rows) is n 10 6 and the number of features (columns, dimension) is d = 91. Experiment. The rows of the original dataset were partitioned into batches on the local hard drive of our laptop computer. Then, an empty coreset tree was initialized in each worker on the cloud; see Fig. 3(a). The streaming was implemented by sending the batches one-by-one to the next worker on the cloud. This is done using a parallel execution of the command: Send the ith batch to the ith worker for i = 1,, 64, which is implemented in Matlab using the command spmd. Each of the 64 workers then computes its coreset, and the spmd command returned an array of 64 coresets. The main process then recomputes a coreset for these 64 coresets locally on the laptop, and runs Matlab s nnmf locally on the small coreset.
8 Evaluation. For comparison, we applied the experiment above also using uniform random sample T instead of our coreset S. In addition, we ran the nnmf function on the full original matrix A using a single computer. We then computed the sum of squared distances from the rows A to the subspaces that are spanned by the corresponding output matrices H T, H S and H A, after applying the appropriate rotation. These sums are denoted by cost(a, H T ), cost(a, H S), and cost(a, H A) respectively. The empirical approximation error of the resulting coreset S is then := cost(a, H S)/cost(A, H A) 1. Hence, cost(a, H S) = (1+)cost(A, H A). Note that, since nnmf implements heuristics, the value of might be negative, which is indeed the case in Fig. 5(a) for the case of coresets. In this case the cost obtained by the coreset to the rows of A is actually smaller than the cost of a run on the full matrix A. Interestingly enough, we never got such a result by using uniform sampling. We repeat the experiments for (i) uniform random sample (with replacement) instead of coresets, (ii) different values of k, (iii) different heuristics implementation of nnmf: als (the default) uses an alternating least-squares algorithm, and mult uses a multiplicative update algorithm. (iv) synthetic input data (standard Gaussian noise) Results on synthetic data. We show results for one of the coreset constructions on one arbitrary worker. As expected, the properties of the coresets look essentially the same for the other workers and the resulting set. In Fig 3(b) we show how the memory (RAM) used by the worker increases when we add more points to its coreset from the synthetic. These results are common to other coresets papers. The zig-zag corresponds to the number of coresets in the tree that each new point needs to update. For example, the first point in a streaming tree is updated in O(1), however the 2 i th point for some i 1 climbs up through O(log i) levels in the tree, so O(log i) coresets need to be merged. The error reduces roughly polynomially on 1/ for the synthetic data, as shown in Fig. 5(b). Results on the real dataset. Fig. 4 shows results for k = 1 dimensional subspace (number of columns of H). Note that (i) for small sample values the coreset already converges to zero. The error of the random sample is still very high and even increases with sample size for such values; (ii) the variance of the error seems to be much smaller for the coreset compared to the uniform sample; (iii) for large sample values ( 2500), the coreset error is essentially zero and the ratio compared to the uniform sample seems to be very high. Fig. 5(a) shows results for k = 10. Similar results were obtained for other values of k. The results on the coreset are almost always better than the uniform sample, on both of the nnmf algorithms. As noted above, for the case k = 10, the results on the coresets are sometimes better than the result on the original data ( < 0). While we do not have a formal explanation, this is a common phenomena when running heuristics on coresets, although it cannot happened when computing optimal solutions. It seems that the heuristics compute local extreme points and sometimes converges to such bad pitfalls in the original data, while the coreset compressed the data and cleaned or smoothed a lot of these bad local extreme points. 6. CONCLUSION We presented in this study a scalable and provable algorithm for general problems in constrained low rank approximation. Our technique is based on two cornerstones: A constructive proof that an approximation to the constrained 1-subspace is spanned by few input points and constraints; and a method to extract from the Big Data input a small coreset on which we may apply this construction, in lieu of the original large input. We extended the scope of previous art by considering general constraints and general metrics. Our algorithms significantly boost existing techniques with respect to runtime, applicability to streaming data, and by being embarrassingly parallelizable. Our experimental results indicate that coresets constitute a theoretical as well as a practical bridge that enables running traditional small data algorithms (which are typically off-line, non-parallel and sometimes inefficient) on modern Big Data. The provable guarantee which we offer is only with respect to the case where the approximating subspace is 1-dimensional. The extension of this algorithm to higher dimensions combines heuristical ingredients. An important goal for a future research is to devise a PTAS, or any other algorithm with a provable guarantee, also for the higher dimensional problems. Another target of future research is to demonstrate the benefits of our proposed techniques in various applications and settings. 7. REFERENCES [1] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Computing a nonnegative matrix factorization - provably. In STOC, [2] S. Bergmann, J. Ihmels, and N. Barkai. Iterative signature algorithm for the analysis of large-scale gene expression data. Physical Review E, 67:031902, [3] C. Boutsidis and P. Drineas. Random projections for the nonnegative least-squares problem. Linear Algebra and its Applications, 431: , [4] K. L. Clarkson and D. Woodruff. Low rank approximation and regression in input sparsity time. In STOC, [5] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for l 2 regression and applications. In SODA, [6] D. Feldman and M. Langberg. A unified framework for approximating and clustering data. In STOC, [7] D. Feldman, M. Monemizadeh, and C. Sohler. A PTAS for k-means clustering based on weak coresets. In SoCG, [8] D. Feldman, M. Monemizadeh, C. Sohler, and D. Woodruff. Coresets and sketches for high dimensional subspace approximation problems. In SODA, [9] D. Feldman, M. Monemizadeh, C. Sohler, and D. P. Woodruff. Coresets and sketches for high dimensional subspace approximation problems. In SODA, [10] D. Feldman, M. Schmidt, and C. Sohler. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In SODA, [11] D. Feldman, M. Volkov, and D. Rus. Dimensionality reduction of massive sparse datasets using coresets. Technical report, 2015.
9 [12] Dan Feldman. NNMF coreset code in matlab. Accessed: [13] M. Ghashami, J. M. Liberty, E.and Phillips, and D. Woodruff. Frequent directions: Simple and deterministic matrix sketching. arxiv preprint arxiv: , [14] S. Har-Peled and S. Mazumdar. On coresets for k-means and k-median clustering. In STOC, [15] Thomas Hofmann. Probabilistic latent semantic analysis. In UAI, [16] S. Hyvönen, P. Miettinen, and E. Terzi. Interpretable nonnegative matrix decompositions. In KDD, [17] H. Kim and H. Park. Sparse non-negative matrix factorizations via alternating non-negativity constrained least squares for microarray data analysis. Bioinformatics, 23: , [18] J. Kim and H. Park. Toward faster nonnegative matrix factorization: A new algorithm and comparisons. In ICDM, [19] Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401: , [20] Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In NIPS, [21] E. Liberty. Simple and deterministic matrix sketching. In KDD, [22] Omid Madani, Manfred Georg, and David A. Ross. On using nearly-independent feature families for high precision and confidence. Machine Learning, 92: , [23] N. Shyamalkumar and K. Varadarajan. Efficient subspace approximation algorithms. In SODA, [24] Kasturi Varadarajan and Xin Xiao. On the sensitivity of shape fitting problems. arxiv preprint arxiv: , [25] Stephen A. Vavasis. On the complexity of nonnegative matrix factorization. SIAM Journal on Optimization, 20: , [26] Wei Xu, Xin Liu, and Yihong Gong. Document clustering based on non-negative matrix factorization. In SIGIR, 2003.
10 (a) Coresets Tree (b) Memory Used Figure 3: (a) The input stream of points at the bottom is partitioned into odd and even subsets (green and pink). The odd subsets will be sent to the left worker and the even sets to the right worker. Each worker maintains a merge-and-reduce tree of size roughly logarithmic in its input (blue frames). Each worker computes a single coreset for its tree and sends it to a central machine. The union of these two coresets forms a coreset for all the input points seen so far in the stream. (b) The memory that is used to save the coresets tree by each worker is only logarithmic in the input size n, since at most one coreset is stored in each of the log n levels Error vs Sample Size, k=1, Algorithm=Als Coreset Uniform Sampling Error vs Sample Size, k=1, Algorithm=mult Coreset Uniform Sampling Error (epsilon) Error (epsilon) Sample Size ( S ) (a) Sample Size ( S ) (b) Figure 4: Approximation error of the input matrix A using the output subspace that was obtained from running Matlab s nnmf(s, k, Algorithm, Alg) function. Here S is either the coreset or uniform random sample, k = 1, and Alg is the chosen Matlab s heuristic als (left) or mult (right) Error vs Sample Size, k=10 Coreset mult Coreset als Uniform Sample mult Uniform Sample als 0.08 Error (epsilon) Sample Size ( S ) (a) Figure 5: (a) Same as Figure 4, using k = 10 in both coreset construction and the call to nnmf. (b) Coreset empirical error on a fixed synthetic random Gaussian data and different coreset size. (b)
Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering
Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering Dan Feldman Melanie Schmidt Christian Sohler Abstract We prove that the sum of the squared Euclidean distances
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University
Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff
Nimble Algorithms for Cloud Computing Ravi Kannan, Santosh Vempala and David Woodruff Cloud computing Data is distributed arbitrarily on many servers Parallel algorithms: time Streaming algorithms: sublinear
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
Offline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: [email protected] 2 IBM India Research Lab, New Delhi. email: [email protected]
Mathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
Unsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.
Approximation Algorithms 11 Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of three
Chapter 11. 11.1 Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling
Approximation Algorithms Chapter Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should I do? A. Theory says you're unlikely to find a poly-time algorithm. Must sacrifice one
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Integer Factorization using the Quadratic Sieve
Integer Factorization using the Quadratic Sieve Chad Seibert* Division of Science and Mathematics University of Minnesota, Morris Morris, MN 56567 [email protected] March 16, 2011 Abstract We give
! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.
Approximation Algorithms Chapter Approximation Algorithms Q Suppose I need to solve an NP-hard problem What should I do? A Theory says you're unlikely to find a poly-time algorithm Must sacrifice one of
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.
MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column
Discuss the size of the instance for the minimum spanning tree problem.
3.1 Algorithm complexity The algorithms A, B are given. The former has complexity O(n 2 ), the latter O(2 n ), where n is the size of the instance. Let n A 0 be the size of the largest instance that can
1 VECTOR SPACES AND SUBSPACES
1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such
3. Linear Programming and Polyhedral Combinatorics
Massachusetts Institute of Technology Handout 6 18.433: Combinatorial Optimization February 20th, 2009 Michel X. Goemans 3. Linear Programming and Polyhedral Combinatorics Summary of what was seen in the
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Linear Codes. Chapter 3. 3.1 Basics
Chapter 3 Linear Codes In order to define codes that we can encode and decode efficiently, we add more structure to the codespace. We shall be mainly interested in linear codes. A linear code of length
Similarity and Diagonalization. Similar Matrices
MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that
Applied Algorithm Design Lecture 5
Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design
Greedy Column Subset Selection for Large-scale Data Sets
Knowledge and Information Systems manuscript No. will be inserted by the editor) Greedy Column Subset Selection for Large-scale Data Sets Ahmed K. Farahat Ahmed Elgohary Ali Ghodsi Mohamed S. Kamel Received:
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms
Linear Algebra Review. Vectors
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka [email protected] http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length
3. INNER PRODUCT SPACES
. INNER PRODUCT SPACES.. Definition So far we have studied abstract vector spaces. These are a generalisation of the geometric spaces R and R. But these have more structure than just that of a vector space.
Chapter 6. Cuboids. and. vol(conv(p ))
Chapter 6 Cuboids We have already seen that we can efficiently find the bounding box Q(P ) and an arbitrarily good approximation to the smallest enclosing ball B(P ) of a set P R d. Unfortunately, both
Compact Representations and Approximations for Compuation in Games
Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions
Notes from Week 1: Algorithms for sequential prediction
CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg 22-26 Jan 2007 1 Introduction In this course we will be looking
CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #18: Dimensionality Reduc7on Dimensionality Reduc=on Assump=on: Data lies on or near a low d- dimensional subspace Axes of this subspace
Linear Programming I
Linear Programming I November 30, 2003 1 Introduction In the VCR/guns/nuclear bombs/napkins/star wars/professors/butter/mice problem, the benevolent dictator, Bigus Piguinus, of south Antarctica penguins
Nonorthogonal Decomposition of Binary Matrices for Bounded-Error Data Compression and Analysis
Nonorthogonal Decomposition of Binary Matrices for Bounded-Error Data Compression and Analysis MEHMET KOYUTÜRK and ANANTH GRAMA Department of Computer Sciences, Purdue University and NAREN RAMAKRISHNAN
Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
Module1. x 1000. y 800.
Module1 1 Welcome to the first module of the course. It is indeed an exciting event to share with you the subject that has lot to offer both from theoretical side and practical aspects. To begin with,
Nonlinear Iterative Partial Least Squares Method
Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for
Clustering Very Large Data Sets with Principal Direction Divisive Partitioning
Clustering Very Large Data Sets with Principal Direction Divisive Partitioning David Littau 1 and Daniel Boley 2 1 University of Minnesota, Minneapolis MN 55455 [email protected] 2 University of Minnesota,
Factoring & Primality
Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount
Solution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
An Empirical Study of Two MIS Algorithms
An Empirical Study of Two MIS Algorithms Email: Tushar Bisht and Kishore Kothapalli International Institute of Information Technology, Hyderabad Hyderabad, Andhra Pradesh, India 32. [email protected],
THE SCHEDULING OF MAINTENANCE SERVICE
THE SCHEDULING OF MAINTENANCE SERVICE Shoshana Anily Celia A. Glass Refael Hassin Abstract We study a discrete problem of scheduling activities of several types under the constraint that at most a single
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: [email protected]
Server Load Prediction
Server Load Prediction Suthee Chaidaroon ([email protected]) Joon Yeong Kim ([email protected]) Jonghan Seo ([email protected]) Abstract Estimating server load average is one of the methods that
Lecture 3: Finding integer solutions to systems of linear equations
Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture
1 Solving LPs: The Simplex Algorithm of George Dantzig
Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.
Learning, Sparsity and Big Data
Learning, Sparsity and Big Data M. Magdon-Ismail (Joint Work) January 22, 2014. Out-of-Sample is What Counts NO YES A pattern exists We don t know it We have data to learn it Tested on new cases? Teaching
Orthogonal Projections
Orthogonal Projections and Reflections (with exercises) by D. Klain Version.. Corrections and comments are welcome! Orthogonal Projections Let X,..., X k be a family of linearly independent (column) vectors
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Sketch As a Tool for Numerical Linear Algebra
Sketching as a Tool for Numerical Linear Algebra (Graph Sparsification) David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania April, 2015 Sepehr Assadi (Penn)
Private Approximation of Clustering and Vertex Cover
Private Approximation of Clustering and Vertex Cover Amos Beimel, Renen Hallak, and Kobbi Nissim Department of Computer Science, Ben-Gurion University of the Negev Abstract. Private approximation of search
Approximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
Load balancing of temporary tasks in the l p norm
Load balancing of temporary tasks in the l p norm Yossi Azar a,1, Amir Epstein a,2, Leah Epstein b,3 a School of Computer Science, Tel Aviv University, Tel Aviv, Israel. b School of Computer Science, The
2.1 Complexity Classes
15-859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look
CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma
CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma Please Note: The references at the end are given for extra reading if you are interested in exploring these ideas further. You are
Lecture 6 Online and streaming algorithms for clustering
CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
1 Sets and Set Notation.
LINEAR ALGEBRA MATH 27.6 SPRING 23 (COHEN) LECTURE NOTES Sets and Set Notation. Definition (Naive Definition of a Set). A set is any collection of objects, called the elements of that set. We will most
Categorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
Sparse Nonnegative Matrix Factorization for Clustering
Sparse Nonnegative Matrix Factorization for Clustering Jingu Kim and Haesun Park College of Computing Georgia Institute of Technology 266 Ferst Drive, Atlanta, GA 30332, USA {jingu, hpark}@cc.gatech.edu
Chapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
Linear Programming. March 14, 2014
Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1
Standardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
Linear Programming. Widget Factory Example. Linear Programming: Standard Form. Widget Factory Example: Continued.
Linear Programming Widget Factory Example Learning Goals. Introduce Linear Programming Problems. Widget Example, Graphical Solution. Basic Theory:, Vertices, Existence of Solutions. Equivalent formulations.
The Online Set Cover Problem
The Online Set Cover Problem Noga Alon Baruch Awerbuch Yossi Azar Niv Buchbinder Joseph Seffi Naor ABSTRACT Let X = {, 2,..., n} be a ground set of n elements, and let S be a family of subsets of X, S
Lecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
ISOMETRIES OF R n KEITH CONRAD
ISOMETRIES OF R n KEITH CONRAD 1. Introduction An isometry of R n is a function h: R n R n that preserves the distance between vectors: h(v) h(w) = v w for all v and w in R n, where (x 1,..., x n ) = x
Going Big in Data Dimensionality:
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
Randomized Robust Linear Regression for big data applications
Randomized Robust Linear Regression for big data applications Yannis Kopsinis 1 Dept. of Informatics & Telecommunications, UoA Thursday, Apr 16, 2015 In collaboration with S. Chouvardas, Harris Georgiou,
Solving polynomial least squares problems via semidefinite programming relaxations
Solving polynomial least squares problems via semidefinite programming relaxations Sunyoung Kim and Masakazu Kojima August 2007, revised in November, 2007 Abstract. A polynomial optimization problem whose
Binary Image Reconstruction
A network flow algorithm for reconstructing binary images from discrete X-rays Kees Joost Batenburg Leiden University and CWI, The Netherlands [email protected] Abstract We present a new algorithm
We shall turn our attention to solving linear systems of equations. Ax = b
59 Linear Algebra We shall turn our attention to solving linear systems of equations Ax = b where A R m n, x R n, and b R m. We already saw examples of methods that required the solution of a linear system
Exponential time algorithms for graph coloring
Exponential time algorithms for graph coloring Uriel Feige Lecture notes, March 14, 2011 1 Introduction Let [n] denote the set {1,..., k}. A k-labeling of vertices of a graph G(V, E) is a function V [k].
24. The Branch and Bound Method
24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no
K-Means Clustering Tutorial
K-Means Clustering Tutorial By Kardi Teknomo,PhD Preferable reference for this tutorial is Teknomo, Kardi. K-Means Clustering Tutorials. http:\\people.revoledu.com\kardi\ tutorial\kmean\ Last Update: July
General Framework for an Iterative Solution of Ax b. Jacobi s Method
2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,
ON THE COMPLEXITY OF THE GAME OF SET. {kamalika,pbg,dratajcz,hoeteck}@cs.berkeley.edu
ON THE COMPLEXITY OF THE GAME OF SET KAMALIKA CHAUDHURI, BRIGHTEN GODFREY, DAVID RATAJCZAK, AND HOETECK WEE {kamalika,pbg,dratajcz,hoeteck}@cs.berkeley.edu ABSTRACT. Set R is a card game played with a
Ideal Class Group and Units
Chapter 4 Ideal Class Group and Units We are now interested in understanding two aspects of ring of integers of number fields: how principal they are (that is, what is the proportion of principal ideals
Continued Fractions and the Euclidean Algorithm
Continued Fractions and the Euclidean Algorithm Lecture notes prepared for MATH 326, Spring 997 Department of Mathematics and Statistics University at Albany William F Hammond Table of Contents Introduction
Algorithmic Techniques for Big Data Analysis. Barna Saha AT&T Lab-Research
Algorithmic Techniques for Big Data Analysis Barna Saha AT&T Lab-Research Challenges of Big Data VOLUME Large amount of data VELOCITY Needs to be analyzed quickly VARIETY Different types of structured
Section 1.1. Introduction to R n
The Calculus of Functions of Several Variables Section. Introduction to R n Calculus is the study of functional relationships and how related quantities change with each other. In your first exposure to
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
How To Prove The Dirichlet Unit Theorem
Chapter 6 The Dirichlet Unit Theorem As usual, we will be working in the ring B of algebraic integers of a number field L. Two factorizations of an element of B are regarded as essentially the same if
Numerical Analysis Lecture Notes
Numerical Analysis Lecture Notes Peter J. Olver 5. Inner Products and Norms The norm of a vector is a measure of its size. Besides the familiar Euclidean norm based on the dot product, there are a number
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS
NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS TEST DESIGN AND FRAMEWORK September 2014 Authorized for Distribution by the New York State Education Department This test design and framework document
BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION
BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION Ş. İlker Birbil Sabancı University Ali Taylan Cemgil 1, Hazal Koptagel 1, Figen Öztoprak 2, Umut Şimşekli
Faster deterministic integer factorisation
David Harvey (joint work with Edgar Costa, NYU) University of New South Wales 25th October 2011 The obvious mathematical breakthrough would be the development of an easy way to factor large prime numbers
Master s Theory Exam Spring 2006
Spring 2006 This exam contains 7 questions. You should attempt them all. Each question is divided into parts to help lead you through the material. You should attempt to complete as much of each problem
THREE DIMENSIONAL GEOMETRY
Chapter 8 THREE DIMENSIONAL GEOMETRY 8.1 Introduction In this chapter we present a vector algebra approach to three dimensional geometry. The aim is to present standard properties of lines and planes,
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,
The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints
