Learning, Sparsity and Big Data M. Magdon-Ismail (Joint Work) January 22, 2014.
Out-of-Sample is What Counts NO YES A pattern exists We don t know it We have data to learn it Tested on new cases? Teaching Machine Learning to a Diverse Audience: the Foundation-based Approach, Lin, M-I, Abu-Mostafa, ICML 2012. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 2 /19 DMOZ Data and SVMs
DMOZ Data and SVMs Good way to generate human labeled text data. US:Pennsylvania:Business&Economy vs. US:Arkansas:Localities:M Bag of (18K) words and SVMs: 300 important words All words Error 24.7% 27.6% important words: pennsylvania, arkansas, business, baptist, company,... Other kinds of tasks: polarity relevance interest topic clustering/classification. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 3 /19 Financial Data
Financial Market Price Data 50 stocks in S&P100 2000-2011. 15 stocks reconstruct price changes(30% error): AA: Alcoa AMGN: Amgen AVP: Avon BHI: Baker Hughes WMB: Williams Companies DIS: Disney C: Citigroup AEP: American Express F: Ford GD: General Dynamics INTC: Intel IBM: IBM MCD: MacDonald s TXN: Texas Instruments XRX: Xerox Optimal, 15 d.o.f. (PCA): 24% error. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 4 /19 Sparsity is good
Throwing Out Unnecessary Features is Good Sparsity: represent your solution using only a few features. Sparse solutions generalize to out-of-sample better less overfitting. Sparse solutions are easier to interpret few important features. Computations are more efficient. Problem (NP-Hard): Find the few relevant features quickly. Question: Are those features provably good? c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 5 /19 Three Learning Problems
Three Learning Problems 1. Data Summarization/Compression (PCA) 2. Categorization/Clustering (k-means) 3. Prediction (Regression) c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 6 /19 Data
Data Data Matrix Response Matrix d dimensions n data points name age debt income hair weight sex John 21yrs $10K $65K black 175lbs M Joe 74yrs $100K $25K blonde 275lbs M Jane 27yrs $20K $85K blonde 135lbs F. Jen 37yrs $400K $105K brun 155lbs F credit? limit risk 2K high 0 10K low. 15K high X R n d Y R n ω c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 7 /19 More beautiful data
More Beautiful Data X R 231 174 Y R 231 166 15 15 % reconstruction error 10 5 % reconstruction error 10 5 0 10 20 30 40 50 60 number of principal components 0 10 20 30 40 50 60 number of principal components c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 8 /19 PCA, K-means, Linear Regression
PCA, K-means, Linear Regression k = 20 r = 2k PCA K-Means Regression [ ] = c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 9 /19 Sparsity example
Sparsity Represent your solution using only a few... Example: linear regression [ ] = Xw = y y is an optimal linear combination of only a few columns in X. (sparse regression; regularization ( w 0 k); feature subset selection;...) c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 10 /19 SVD
Singular Value Decomposition (SVD) X = [ ] [ ] [ ] Σ U k U k 0 V t k d k 0 Σ d k Vd k t O(nd 2 ) U Σ V t (n d) (d d) (d d) X k = U k Σ k V t k = XV k V t k X k is the best rank-k approximation to X. Reconstruction of X using only a few deg. of freedom. X X 20 X 40 X 60 V k is an orthonormal basis for the best k-dimensional subspace of the row space of X. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 11 /19 Fast approximate SVD
Fast Approximate SVD 1: Z = XR R N(d r), Z R n r 2: Q = qr.factorize(z) 3: ˆVk svd k (Q t X) Theorem. Let r = k(1+ 1 ǫ ) and E = X XˆV kˆvt k. Then, E[ E ] (1+ǫ) X X k running time is O(ndk) = o(svd) [BDM, FOCS 2011] We can construct V k quickly c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 12 /19 V k and sparsity
V k and Sparsity Important dimensions of V t k are important for X s 1 s 2 s 3 s 4 s 5 V t k ˆV t k Rk r The sampled r columns are good if I = V t kv k ˆV t kˆv k. Sampling schemes: Largest norm (Jollife, 1972); Randomized norm sampling (Rudelson, 1999; RudelsonVershynin, 2007); Greedy (Batson et al, 2009; BDM, 2011). V k is important c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 13 /19 Sparse PCA algorithm
Sparse PCA Algorithm 1: Choose a few columns C of X; C R n r. 2: Find the best rank-k approximation of X in the span of C, X C,k. 3: Compute the SVD k of X C,k = U C,k Σ C,k V t C,k. 4: Z = XV C,k. Each feature in Z is a mixture of only the few original r feature dimensions in C. X XV C,k V t C,k X X C,k V C,k V t C,k = X X C,k ( 1+O( 2k r )) X X k. [BDM, FOCS 2011] c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 14 /19 Sparse PCA - pictures
Sparse PCA k = 20 k = 40 k = 60 Dense PCA Sparse PCA, r = 2k Theorem. One can construct, in o(svd), k features that are r-sparse, r = O(k), that are as good as exact dense top-k PCA-features. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 15 /19 Clustering: K-Means
Full, slow Clustering: K-Means Fast, sparse 3 clusters 4 Clusters Theorem. There is a subset of features of size O(#clusters) which produces nearly the optimal partition (within a constant factor). One can quickly produce features with a log-approximation factor. [BDM,2013] c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 16 /19 Regression using few important features
PCA Regression using Few Important Features PCA, slow, dense Sparse, fast k = 20 k = 40 Theorem. Can find O(k) pure features which performs as well top-k pca-regression (additive error controlled by X X k F /σ k ). [BDM,2013] c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 17 /19 Proofs
The Proofs All the algorithms use the sparsifier of V t k in [BDM,FOCS2011]. 1.Choose columns of V t k to preserve its singular values. 2. Ensure that the selected columns preserve the structural properties of the objective with respect to the columns of X that are sampled. 3. Use dual set sparsification algorithms to accomplish (2). c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 18 /19 Thanks
THANKS! Data Compression (PCA): Unsupervised k-means Clustering: Supervised PCA Regression: Support Vector Machines Coresets (subsets of data instead features) Few features: easy to interpret; better generalizers; faster computations. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 19 /19