Learning, Sparsity and Big Data



Similar documents
CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

The Data Mining Process

Linear Algebra Review. Vectors

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Greedy Column Subset Selection for Large-scale Data Sets

Big Data Analytics: Optimization and Randomization

HT2015: SC4 Statistical Data Mining and Machine Learning

Multidimensional data analysis

Sketch As a Tool for Numerical Linear Algebra

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

6. Cholesky factorization

Text Analytics (Text Mining)

α = u v. In other words, Orthogonal Projection

Fast Analytics on Big Data with H20

Unsupervised Data Mining (Clustering)

The Many Facets of Big Data

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Chapter 6. Orthogonality

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Machine Learning for Data Science (CS4786) Lecture 1

Orthogonal Projections and Orthonormal Bases

Randomized Robust Linear Regression for big data applications

Social Media Mining. Data Mining Essentials

Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Scalable Machine Learning - or what to do with all that Big Data infrastructure

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Statistical machine learning, high dimension and big data

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Monotonicity Hints. Abstract

Cluster Analysis: Advanced Concepts

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Orthogonal Diagonalization of Symmetric Matrices

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

When is missing data recoverable?

Role of Social Networking in Marketing using Data Mining

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Supervised Learning (Big Data Analytics)

Wavelet analysis. Wavelet requirements. Example signals. Stationary signal 2 Hz + 10 Hz + 20Hz. Zero mean, oscillatory (wave) Fast decay (let)

Orthogonal Projections

Collaborative Filtering. Radek Pelánek

NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing

SYMMETRIC EIGENFACES MILI I. SHAH

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Going Big in Data Dimensionality:

Inner Product Spaces and Orthogonality

MLlib: Scalable Machine Learning on Spark

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

Health Spring Meeting May 2008 Session # 42: Dental Insurance What's New, What's Important

Lecture 5: Singular Value Decomposition SVD (1)

Similarity and Diagonalization. Similar Matrices

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Machine learning for algo trading

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

Nonlinear Iterative Partial Least Squares Method

Big learning: challenges and opportunities

Compact Representations and Approximations for Compuation in Games

MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.

Learning Tools for Big Data Analytics

These axioms must hold for all vectors ū, v, and w in V and all scalars c and d.

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

1 Introduction to Matrices

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

How To Cluster

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Data Mining. Nonlinear Classification

Large-Scale Spectral Clustering on Graphs

Learning from Diversity

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

Bilinear Prediction Using Low-Rank Models

Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity

Azure Machine Learning, SQL Data Mining and R

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Analysis of Financial Data Using Non-Negative Matrix Factorization

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Support Vector Machines with Clustering for Training with Very Large Datasets

So which is the best?

3D Tranformations. CS 4620 Lecture 6. Cornell CS4620 Fall 2013 Lecture Steve Marschner (with previous instructors James/Bala)

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Transcription:

Learning, Sparsity and Big Data M. Magdon-Ismail (Joint Work) January 22, 2014.

Out-of-Sample is What Counts NO YES A pattern exists We don t know it We have data to learn it Tested on new cases? Teaching Machine Learning to a Diverse Audience: the Foundation-based Approach, Lin, M-I, Abu-Mostafa, ICML 2012. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 2 /19 DMOZ Data and SVMs

DMOZ Data and SVMs Good way to generate human labeled text data. US:Pennsylvania:Business&Economy vs. US:Arkansas:Localities:M Bag of (18K) words and SVMs: 300 important words All words Error 24.7% 27.6% important words: pennsylvania, arkansas, business, baptist, company,... Other kinds of tasks: polarity relevance interest topic clustering/classification. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 3 /19 Financial Data

Financial Market Price Data 50 stocks in S&P100 2000-2011. 15 stocks reconstruct price changes(30% error): AA: Alcoa AMGN: Amgen AVP: Avon BHI: Baker Hughes WMB: Williams Companies DIS: Disney C: Citigroup AEP: American Express F: Ford GD: General Dynamics INTC: Intel IBM: IBM MCD: MacDonald s TXN: Texas Instruments XRX: Xerox Optimal, 15 d.o.f. (PCA): 24% error. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 4 /19 Sparsity is good

Throwing Out Unnecessary Features is Good Sparsity: represent your solution using only a few features. Sparse solutions generalize to out-of-sample better less overfitting. Sparse solutions are easier to interpret few important features. Computations are more efficient. Problem (NP-Hard): Find the few relevant features quickly. Question: Are those features provably good? c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 5 /19 Three Learning Problems

Three Learning Problems 1. Data Summarization/Compression (PCA) 2. Categorization/Clustering (k-means) 3. Prediction (Regression) c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 6 /19 Data

Data Data Matrix Response Matrix d dimensions n data points name age debt income hair weight sex John 21yrs $10K $65K black 175lbs M Joe 74yrs $100K $25K blonde 275lbs M Jane 27yrs $20K $85K blonde 135lbs F. Jen 37yrs $400K $105K brun 155lbs F credit? limit risk 2K high 0 10K low. 15K high X R n d Y R n ω c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 7 /19 More beautiful data

More Beautiful Data X R 231 174 Y R 231 166 15 15 % reconstruction error 10 5 % reconstruction error 10 5 0 10 20 30 40 50 60 number of principal components 0 10 20 30 40 50 60 number of principal components c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 8 /19 PCA, K-means, Linear Regression

PCA, K-means, Linear Regression k = 20 r = 2k PCA K-Means Regression [ ] = c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 9 /19 Sparsity example

Sparsity Represent your solution using only a few... Example: linear regression [ ] = Xw = y y is an optimal linear combination of only a few columns in X. (sparse regression; regularization ( w 0 k); feature subset selection;...) c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 10 /19 SVD

Singular Value Decomposition (SVD) X = [ ] [ ] [ ] Σ U k U k 0 V t k d k 0 Σ d k Vd k t O(nd 2 ) U Σ V t (n d) (d d) (d d) X k = U k Σ k V t k = XV k V t k X k is the best rank-k approximation to X. Reconstruction of X using only a few deg. of freedom. X X 20 X 40 X 60 V k is an orthonormal basis for the best k-dimensional subspace of the row space of X. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 11 /19 Fast approximate SVD

Fast Approximate SVD 1: Z = XR R N(d r), Z R n r 2: Q = qr.factorize(z) 3: ˆVk svd k (Q t X) Theorem. Let r = k(1+ 1 ǫ ) and E = X XˆV kˆvt k. Then, E[ E ] (1+ǫ) X X k running time is O(ndk) = o(svd) [BDM, FOCS 2011] We can construct V k quickly c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 12 /19 V k and sparsity

V k and Sparsity Important dimensions of V t k are important for X s 1 s 2 s 3 s 4 s 5 V t k ˆV t k Rk r The sampled r columns are good if I = V t kv k ˆV t kˆv k. Sampling schemes: Largest norm (Jollife, 1972); Randomized norm sampling (Rudelson, 1999; RudelsonVershynin, 2007); Greedy (Batson et al, 2009; BDM, 2011). V k is important c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 13 /19 Sparse PCA algorithm

Sparse PCA Algorithm 1: Choose a few columns C of X; C R n r. 2: Find the best rank-k approximation of X in the span of C, X C,k. 3: Compute the SVD k of X C,k = U C,k Σ C,k V t C,k. 4: Z = XV C,k. Each feature in Z is a mixture of only the few original r feature dimensions in C. X XV C,k V t C,k X X C,k V C,k V t C,k = X X C,k ( 1+O( 2k r )) X X k. [BDM, FOCS 2011] c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 14 /19 Sparse PCA - pictures

Sparse PCA k = 20 k = 40 k = 60 Dense PCA Sparse PCA, r = 2k Theorem. One can construct, in o(svd), k features that are r-sparse, r = O(k), that are as good as exact dense top-k PCA-features. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 15 /19 Clustering: K-Means

Full, slow Clustering: K-Means Fast, sparse 3 clusters 4 Clusters Theorem. There is a subset of features of size O(#clusters) which produces nearly the optimal partition (within a constant factor). One can quickly produce features with a log-approximation factor. [BDM,2013] c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 16 /19 Regression using few important features

PCA Regression using Few Important Features PCA, slow, dense Sparse, fast k = 20 k = 40 Theorem. Can find O(k) pure features which performs as well top-k pca-regression (additive error controlled by X X k F /σ k ). [BDM,2013] c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 17 /19 Proofs

The Proofs All the algorithms use the sparsifier of V t k in [BDM,FOCS2011]. 1.Choose columns of V t k to preserve its singular values. 2. Ensure that the selected columns preserve the structural properties of the objective with respect to the columns of X that are sampled. 3. Use dual set sparsification algorithms to accomplish (2). c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 18 /19 Thanks

THANKS! Data Compression (PCA): Unsupervised k-means Clustering: Supervised PCA Regression: Support Vector Machines Coresets (subsets of data instead features) Few features: easy to interpret; better generalizers; faster computations. c Creator: Malik Magdon-Ismail Learning, Sparsity and Big Data: 19 /19