Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Similar documents

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Data, Measurements, Features

Unsupervised and supervised dimension reduction: Algorithms and connections

Component Ordering in Independent Component Analysis Based on Data Power

Exploratory data analysis for microarray data

The Data Mining Process

Machine Learning: Overview

The Scientific Data Mining Process

bionmf: a web-based tool for nonnegative matrix factorization in biology

Sparse Nonnegative Matrix Factorization for Clustering

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Class-specific Sparse Coding for Learning of Object Representations

A Survey on Outlier Detection Techniques for Credit Card Fraud Detection

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

Fast Analytics on Big Data with H20

Text Analytics (Text Mining)

Efficient online learning of a non-negative sparse autoencoder

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Statistical machine learning, high dimension and big data

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Integrated Data Mining Strategy for Effective Metabolomic Data Analysis

Using multiple models: Bagging, Boosting, Ensembles, Forests

D-optimal plans in observational studies

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Programming Exercise 3: Multi-class Classification and Neural Networks

Supervised Feature Selection & Unsupervised Dimensionality Reduction

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Section for Cognitive Systems DTU Informatics, Technical University of Denmark

Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

Supervised and unsupervised learning - 1

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Final Project Report

Distributed forests for MapReduce-based machine learning

New Ensemble Combination Scheme

Big Data Text Mining and Visualization. Anton Heijs

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Machine Learning and Pattern Recognition Logistic Regression

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Multidimensional data analysis

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

Data Mining Applications in Fund Raising

UNIVERSAL SPEECH MODELS FOR SPEAKER INDEPENDENT SINGLE CHANNEL SOURCE SEPARATION

Image Compression through DCT and Huffman Coding Technique

TIETS34 Seminar: Data Mining on Biometric identification

Science Navigation Map: An Interactive Data Mining Tool for Literature Analysis

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Object Recognition and Template Matching

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

CS Introduction to Data Mining Instructor: Abdullah Mueen

How To Make A Credit Risk Model For A Bank Account

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Clustering Very Large Data Sets with Principal Direction Divisive Partitioning

Support Vector Machines with Clustering for Training with Very Large Datasets

Data Preprocessing. Week 2

Learning, Sparsity and Big Data

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

DATA ANALYSIS II. Matrix Algorithms

Map-Reduce for Machine Learning on Multicore

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Adaptive Framework for Network Traffic Classification using Dimensionality Reduction and Clustering

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Discriminant non-stationary signal features clustering using hard and fuzzy cluster labeling

Principal Component Analysis

Introduction to Machine Learning Using Python. Vikram Kamath

Big Data Summarization Using Semantic. Feture for IoT on Cloud

Machine learning for algo trading

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Wavelet analysis. Wavelet requirements. Example signals. Stationary signal 2 Hz + 10 Hz + 20Hz. Zero mean, oscillatory (wave) Fast decay (let)

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

6. Cholesky factorization

Decision Trees from large Databases: SLIQ

Introduction to Pattern Recognition

Standardization and Its Effects on K-Means Clustering Algorithm

Soft Clustering with Projections: PCA, ICA, and Laplacian

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Classification Problems

Probabilistic Latent Semantic Analysis (plsa)

Using Data Mining for Mobile Communication Clustering and Characterization

Unsupervised Data Mining (Clustering)

Maschinelles Lernen mit MATLAB

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Leveraging Ensemble Models in SAS Enterprise Miner

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Group Testing a tool of protecting Network Security

Joint models for classification and comparison of mortality in different countries.

Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Transcription:

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013

Outline Introduction to NMF Applications Motivations NMF as a middle step in a semi-supervised learning framework Support vector machines Random forests Future directions and Q&A

Introduction Consider a matrix X p n s.t. X i,j 0 i, j Non-standard interpretation for statisticians Rows are features Columns are samples Non-negative matrix factorization X p n = W p k H k n + E p n W R p k + - Basis Matrix H R k n + - Coefficient Matrix E - Error matrix Advantage: NMF decomposes original matrix into a parts based representation that gives better interpretation of factoring matrices for non-negative data

Better Interpretation: Lee and Seung, 1999

Why do we care? Many real world applications - some in health care! 1. Text mining - document clustering, topic detection and trend tracking 2. Image analysis - feature representation, sparse coding, video tracking, image compression, image reconstruction, semi-supervised learning 3. Social/Interaction networks - community detection, recommendation systems 4. Bioinformatics - -omics data analysis 5. Acoustic Signal Processing - blind source separation 6. Data clustering...to name a few!

Community Detection H can be interpreted as indicating community membership Illustrated here using a cell phone network of 177 cell towers in DR 177 177 matrix of normalized call flows (i.e., ij th element = proportion of calls from i to j.) y 18.5 19.0 19.5 71.5 Kenneth 70.5 K. Lopiano 69.5 68.5 x

Community Detection Clear separation in captial city - West is higher income, east is lower income y 18.40 18.45 18.50 18.55 18.60 70.00 69.95 69.90 69.85 69.80 x

Metagenomics: Brunet et al. Efficient method for identification of distinct molecular patterns and provides a powerful method for class discovery

Audio Source Separation: Battenberg and Wessel Matrix - number of positive frequency bins by number of analysis frames CUDA implementation...the newer Geforce GTX 280, with 30 multiprocessors at 1.3GHz, runs the CUDA implementation over 30x faster than the optimized Matlab implementation

So it matters - now what? Can I estimate W and H? How is this done? What are the properties of my estimators? A fruitful area of research related to NMF has been related to developing algorithms to answer these questions, however, I am not interested in improving/comparing algorithms. I am interested in using the algorithms in different applications and understanding the unique benefits of NMF.

Loss Functions For completeness - a brief review Frobenius norm KL Divergence min W,H i,j min W,H ( Xij X ij log X WH 2 F s.t. W, H 0 (WH) ij s.t. W, H 0 ) X ij + (WH) ij Sparsity constraints on H (similarly defined for sparsity on W) min W,H many more... n X WH 2 F + η W 2 F + β H(:, j) 2 1 s.t. W 0, H 0 j=1

Algorithms http://www.stanford.edu/group/mmds/slides2012/s-park.pdf

NMF for Partially Labeled Data NMF is an unsupervised learning algorithm to reduce dimension of original data Question: Suppose some observations are labeled (e.g., diseased versus not diseased). If the weight vectors are used as covariates in a statistical learning framework, then does NMF give any clear advantages over other dimension reduction techniques (e.g., PCA)?

Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the 13782 observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: 11791 observations and 784 covariates The reduced training data: 11791 observations and k dimensions Predict the class membership of n m = 1991 validation observations

Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the 13782 observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: 11791 observations and 784 covariates The reduced training data: 11791 observations and k dimensions Predict the class membership of n m = 1991 validation observations

Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the 13782 observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: 11791 observations and 784 covariates The reduced training data: 11791 observations and k dimensions Predict the class membership of n m = 1991 validation observations

Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the 13782 observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: 11791 observations and 784 covariates The reduced training data: 11791 observations and k dimensions Predict the class membership of n m = 1991 validation observations

Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the 13782 observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: 11791 observations and 784 covariates The reduced training data: 11791 observations and k dimensions Predict the class membership of n m = 1991 validation observations

Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the 13782 observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: 11791 observations and 784 covariates The reduced training data: 11791 observations and k dimensions Predict the class membership of n m = 1991 validation observations

Semi-Supervised Dimensionality Reduction NMF is an unsupervised learning algorithm to reduce dimension of original data Goal: Incorporate information from labeled examples to estimate the rank of the lower dimensional data. Example: MNIST Data - Comparing 4s and 9s - n = 13782, p = 784 pixels, m = 11791, Y i, i = 1,..., m, is an indicator the i th observations is a 4 or 9. Use NMF to project the 13782 observations from 784 dimensions to k = 8, 16, 32, 64, 128, and 256 dimensions (multiplicative updates algorithm) Use support vector machines to train a classifier using The full training data: 11791 observations and 784 covariates The reduced training data: 11791 observations and k dimensions Predict the class membership of n m = 1991 validation observations

Results Prediction Error in Reduced Dimension Prediction Error (%) 1 2 3 4 5 6 7 8 NMF PC Full Dimension 8 16 32 64 128 256 k

Idea - Reducing Dimension and Maintaining Meaning NMF gives factors that are more interpretable than those obtained from PCA or SVD. Does this mean that the importance of the variables in the reduced dimension can be interpreted?

Random Forests and Variable Importance Random forest - machine learning algorithm used for classification and regression - many decision trees are trained to many samples of the original data and combined to form final classification rules (details omitted here) Variable importance measures can be used to identify which variables are important for the learning task. Gini Impurity - I = 1 2 i=1 f i 2, f i = fraction of items labeled i in the set Gini importance - Every time a split of a node is made on a variable the gini impurity criterion for the two descendent nodes is less than the parent node. Adding up the gini decreases for each individual variable over all trees in the forest gives a fast variable importance... www.stat.berkeley.edu/ Kenneth breiman/randomforests/cc K. Lopiano home.htm

Results k = 32, 64, 128 Random forest using k factors obtained through NMF and first k principal components The 4 most important variables are plotted for both NMF and PCA

Examples Example 4 Mean 4r Example 9 Mean 9

Results k=32,64,and 128 k=32,64,and 128 k=32,64,and 128 k=32,64,and 128

Results k = 32 k=32 k=32 k=32 k=32

Results k = 64 k=64 k=64 k=64 k=64

Results k = 128 k=128 k=128 k=128 k=128

Moving forward NMF as a middle step - classification or prediction as final step More examples - genetics and medical imaging With prediction or classification in mind - minimize mean squared prediction error and cross validation to choose k

References Battenberg, E. and Wessel, D. (2009) Accelerating non-negative matrix factorization for audio source separation on multi-core and many core architectures Proceedings of the 10th International Society for Music Information Retrieval Conference (ISMIR 2009) Brunet et al. (2004) Metagenes and molecular pattern discovery using matrix factorization. PNAS. vol 101, no 12. Jiang, X. et al. (2012) A non-negative matrix factorization framework for identifying modular paterns in metagenomic profile data. Journal of Mathematical Biology. vol 64. pp 697-711 Kim, J. and Park, H. (2008) Sparse NMF for Clustering. http://www.cc.gatech.edu/ hpark/papers/gt-cse-08-01.pdf Mazack, M. (2009) Non-Negative Matrix Factorization with Applications to Handwritten Digit Recognition, Working Paper, University of Minnesota. http://mazack.org/papers/mazack nmf paper.pdf Wang, F et al. (2010) Community discovery using NMF. Data Mining and Knowledge Discovery