Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Transcription

1 Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013

2 Outline Introduction to NMF Applications Motivations NMF as a middle step in a semi-supervised learning framework Support vector machines Random forests Future directions and Q&A

3 Introduction Consider a matrix X p n s.t. X i,j 0 i, j Non-standard interpretation for statisticians Rows are features Columns are samples Non-negative matrix factorization X p n = W p k H k n + E p n W R p k + - Basis Matrix H R k n + - Coefficient Matrix E - Error matrix Advantage: NMF decomposes original matrix into a parts based representation that gives better interpretation of factoring matrices for non-negative data

4 Better Interpretation: Lee and Seung, 1999

5 Why do we care? Many real world applications - some in health care! 1. Text mining - document clustering, topic detection and trend tracking 2. Image analysis - feature representation, sparse coding, video tracking, image compression, image reconstruction, semi-supervised learning 3. Social/Interaction networks - community detection, recommendation systems 4. Bioinformatics - -omics data analysis 5. Acoustic Signal Processing - blind source separation 6. Data clustering...to name a few!

6 Community Detection H can be interpreted as indicating community membership Illustrated here using a cell phone network of 177 cell towers in DR matrix of normalized call flows (i.e., ij th element = proportion of calls from i to j.) y Kenneth 70.5 K. Lopiano x

7 Community Detection Clear separation in captial city - West is higher income, east is lower income y x

8 Metagenomics: Brunet et al. Efficient method for identification of distinct molecular patterns and provides a powerful method for class discovery

9 Audio Source Separation: Battenberg and Wessel Matrix - number of positive frequency bins by number of analysis frames CUDA implementation...the newer Geforce GTX 280, with 30 multiprocessors at 1.3GHz, runs the CUDA implementation over 30x faster than the optimized Matlab implementation

10 So it matters - now what? Can I estimate W and H? How is this done? What are the properties of my estimators? A fruitful area of research related to NMF has been related to developing algorithms to answer these questions, however, I am not interested in improving/comparing algorithms. I am interested in using the algorithms in different applications and understanding the unique benefits of NMF.

11 Loss Functions For completeness - a brief review Frobenius norm KL Divergence min W,H i,j min W,H ( Xij X ij log X WH 2 F s.t. W, H 0 (WH) ij s.t. W, H 0 ) X ij + (WH) ij Sparsity constraints on H (similarly defined for sparsity on W) min W,H many more... n X WH 2 F + η W 2 F + β H(:, j) 2 1 s.t. W 0, H 0 j=1

12 Algorithms

13 NMF for Partially Labeled Data NMF is an unsupervised learning algorithm to reduce dimension of original data Question: Suppose some observations are labeled (e.g., diseased versus not diseased). If the weight vectors are used as covariates in a statistical learning framework, then does NMF give any clear advantages over other dimension reduction techniques (e.g., PCA)?

14 Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: observations and 784 covariates The reduced training data: observations and k dimensions Predict the class membership of n m = 1991 validation observations

21 Results Prediction Error in Reduced Dimension Prediction Error (%) NMF PC Full Dimension k

22 Idea - Reducing Dimension and Maintaining Meaning NMF gives factors that are more interpretable than those obtained from PCA or SVD. Does this mean that the importance of the variables in the reduced dimension can be interpreted?

23 Random Forests and Variable Importance Random forest - machine learning algorithm used for classification and regression - many decision trees are trained to many samples of the original data and combined to form final classification rules (details omitted here) Variable importance measures can be used to identify which variables are important for the learning task. Gini Impurity - I = 1 2 i=1 f i 2, f i = fraction of items labeled i in the set Gini importance - Every time a split of a node is made on a variable the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance... Kenneth breiman/randomforests/cc K. Lopiano home.htm

24 Results k = 32, 64, 128 Random forest using k factors obtained through NMF and first k principal components The 4 most important variables are plotted for both NMF and PCA

25 Examples Example 4 Mean 4r Example 9 Mean 9

26 Results k=32,64,and 128 k=32,64,and 128 k=32,64,and 128 k=32,64,and 128

27 Results k = 32 k=32 k=32 k=32 k=32

28 Results k = 64 k=64 k=64 k=64 k=64

29 Results k = 128 k=128 k=128 k=128 k=128

30 Moving forward NMF as a middle step - classification or prediction as final step More examples - genetics and medical imaging With prediction or classification in mind - minimize mean squared prediction error and cross validation to choose k

31 References Battenberg, E. and Wessel, D. (2009) Accelerating non-negative matrix factorization for audio source separation on multi-core and many core architectures Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR 2009) Brunet et al. (2004) Metagenes and molecular pattern discovery using matrix factorization. PNAS. vol 101, no 12. Jiang, X. et al. (2012) A non-negative matrix factorization framework for identifying modular paterns in metagenomic profile data. Journal of Mathematical Biology. vol 64. pp Kim, J. and Park, H. (2008) Sparse NMF for Clustering. hpark/papers/gt-cse pdf Mazack, M. (2009) Non-Negative Matrix Factorization with Applications to Handwritten Digit Recognition, Working Paper, University of Minnesota. nmf paper.pdf Wang, F et al. (2010) Community discovery using NMF. Data Mining and Knowledge Discovery