Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013
Outline Introduction to NMF Applications Motivations NMF as a middle step in a semi-supervised learning framework Support vector machines Random forests Future directions and Q&A
Introduction Consider a matrix X p n s.t. X i,j 0 i, j Non-standard interpretation for statisticians Rows are features Columns are samples Non-negative matrix factorization X p n = W p k H k n + E p n W R p k + - Basis Matrix H R k n + - Coefficient Matrix E - Error matrix Advantage: NMF decomposes original matrix into a parts based representation that gives better interpretation of factoring matrices for non-negative data
Better Interpretation: Lee and Seung, 1999
Why do we care? Many real world applications - some in health care! 1. Text mining - document clustering, topic detection and trend tracking 2. Image analysis - feature representation, sparse coding, video tracking, image compression, image reconstruction, semi-supervised learning 3. Social/Interaction networks - community detection, recommendation systems 4. Bioinformatics - -omics data analysis 5. Acoustic Signal Processing - blind source separation 6. Data clustering...to name a few!
Community Detection H can be interpreted as indicating community membership Illustrated here using a cell phone network of 177 cell towers in DR 177 177 matrix of normalized call flows (i.e., ij th element = proportion of calls from i to j.) y 18.5 19.0 19.5 71.5 Kenneth 70.5 K. Lopiano 69.5 68.5 x
Community Detection Clear separation in captial city - West is higher income, east is lower income y 18.40 18.45 18.50 18.55 18.60 70.00 69.95 69.90 69.85 69.80 x
Metagenomics: Brunet et al. Efficient method for identification of distinct molecular patterns and provides a powerful method for class discovery
Audio Source Separation: Battenberg and Wessel Matrix - number of positive frequency bins by number of analysis frames CUDA implementation...the newer Geforce GTX 280, with 30 multiprocessors at 1.3GHz, runs the CUDA implementation over 30x faster than the optimized Matlab implementation
So it matters - now what? Can I estimate W and H? How is this done? What are the properties of my estimators? A fruitful area of research related to NMF has been related to developing algorithms to answer these questions, however, I am not interested in improving/comparing algorithms. I am interested in using the algorithms in different applications and understanding the unique benefits of NMF.
Loss Functions For completeness - a brief review Frobenius norm KL Divergence min W,H i,j min W,H ( Xij X ij log X WH 2 F s.t. W, H 0 (WH) ij s.t. W, H 0 ) X ij + (WH) ij Sparsity constraints on H (similarly defined for sparsity on W) min W,H many more... n X WH 2 F + η W 2 F + β H(:, j) 2 1 s.t. W 0, H 0 j=1
Algorithms http://www.stanford.edu/group/mmds/slides2012/s-park.pdf
NMF for Partially Labeled Data NMF is an unsupervised learning algorithm to reduce dimension of original data Question: Suppose some observations are labeled (e.g., diseased versus not diseased). If the weight vectors are used as covariates in a statistical learning framework, then does NMF give any clear advantages over other dimension reduction techniques (e.g., PCA)?
Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the 13782 observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: 11791 observations and 784 covariates The reduced training data: 11791 observations and k dimensions Predict the class membership of n m = 1991 validation observations
Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the 13782 observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: 11791 observations and 784 covariates The reduced training data: 11791 observations and k dimensions Predict the class membership of n m = 1991 validation observations
Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the 13782 observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: 11791 observations and 784 covariates The reduced training data: 11791 observations and k dimensions Predict the class membership of n m = 1991 validation observations
Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the 13782 observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: 11791 observations and 784 covariates The reduced training data: 11791 observations and k dimensions Predict the class membership of n m = 1991 validation observations
Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the 13782 observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: 11791 observations and 784 covariates The reduced training data: 11791 observations and k dimensions Predict the class membership of n m = 1991 validation observations
Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the 13782 observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: 11791 observations and 784 covariates The reduced training data: 11791 observations and k dimensions Predict the class membership of n m = 1991 validation observations
Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the 13782 observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: 11791 observations and 784 covariates The reduced training data: 11791 observations and k dimensions Predict the class membership of n m = 1991 validation observations
Results Prediction Error in Reduced Dimension Prediction Error (%) 1 2 3 4 5 6 7 8 NMF PC Full Dimension 8 16 32 64 128 256 k
Idea - Reducing Dimension and Maintaining Meaning NMF gives factors that are more interpretable than those obtained from PCA or SVD. Does this mean that the importance of the variables in the reduced dimension can be interpreted?
Random Forests and Variable Importance Random forest - machine learning algorithm used for classification and regression - many decision trees are trained to many samples of the original data and combined to form final classification rules (details omitted here) Variable importance measures can be used to identify which variables are important for the learning task. Gini Impurity - I = 1 2 i=1 f i 2, f i = fraction of items labeled i in the set Gini importance - Every time a split of a node is made on a variable the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance... www.stat.berkeley.edu/ Kenneth breiman/randomforests/cc K. Lopiano home.htm
Results k = 32, 64, 128 Random forest using k factors obtained through NMF and first k principal components The 4 most important variables are plotted for both NMF and PCA
Examples Example 4 Mean 4r Example 9 Mean 9
Results k=32,64,and 128 k=32,64,and 128 k=32,64,and 128 k=32,64,and 128
Results k = 32 k=32 k=32 k=32 k=32
Results k = 64 k=64 k=64 k=64 k=64
Results k = 128 k=128 k=128 k=128 k=128
Moving forward NMF as a middle step - classification or prediction as final step More examples - genetics and medical imaging With prediction or classification in mind - minimize mean squared prediction error and cross validation to choose k
References Battenberg, E. and Wessel, D. (2009) Accelerating non-negative matrix factorization for audio source separation on multi-core and many core architectures Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR 2009) Brunet et al. (2004) Metagenes and molecular pattern discovery using matrix factorization. PNAS. vol 101, no 12. Jiang, X. et al. (2012) A non-negative matrix factorization framework for identifying modular paterns in metagenomic profile data. Journal of Mathematical Biology. vol 64. pp 697-711 Kim, J. and Park, H. (2008) Sparse NMF for Clustering. http://www.cc.gatech.edu/ hpark/papers/gt-cse-08-01.pdf Mazack, M. (2009) Non-Negative Matrix Factorization with Applications to Handwritten Digit Recognition, Working Paper, University of Minnesota. http://mazack.org/papers/mazack nmf paper.pdf Wang, F et al. (2010) Community discovery using NMF. Data Mining and Knowledge Discovery