diagnosis through Random

Similar documents

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Supervised Feature Selection & Unsupervised Dimensionality Reduction

II. RELATED WORK. Sentiment Mining

Maschinelles Lernen mit MATLAB

Data Mining. Nonlinear Classification

Robust Feature Selection Using Ensemble Feature Selection Techniques

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Data Mining Practical Machine Learning Tools and Techniques

Gene Selection for Cancer Classification using Support Vector Machines

Decompose Error Rate into components, some of which can be measured on unlabeled data

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Knowledge Discovery and Data Mining

Combining SVM classifiers for anti-spam filtering

Data, Measurements, Features

Gene expression analysis. Ulf Leser and Karin Zimmermann

Predicting Flight Delays

: Introduction to Machine Learning Dr. Rita Osadchy

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Model Combination. 24 Novembre 2009

International Journal of Software and Web Sciences (IJSWS)

Data Mining - Evaluation of Classifiers

Chapter 6. The stacking ensemble approach

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Supervised Learning (Big Data Analytics)

Comparison of Data Mining Techniques used for Financial Data Analysis

REVIEW OF ENSEMBLE CLASSIFICATION

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Advanced Ensemble Strategies for Polynomial Models

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Feature Selection with Monte-Carlo Tree Search

Knowledge Discovery and Data Mining

Social Media Mining. Data Mining Essentials

Ensemble Learning of Colorectal Cancer Survival Rates

Statistics W4240: Data Mining Columbia University Spring, 2014

Applied Multivariate Analysis - Big data analytics

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Data Mining Techniques for Prognosis in Pancreatic Cancer

An Experimental Study on Rotation Forest Ensembles

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

An Overview of Knowledge Discovery Database and Data mining Techniques

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

Automatic Text Processing: Cross-Lingual. Text Categorization

A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode

Azure Machine Learning, SQL Data Mining and R

How To Identify A Churner

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Introduction to Data Mining

MS1b Statistical Data Mining

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Data Mining and Machine Learning in Bioinformatics

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Application of Event Based Decision Tree and Ensemble of Data Driven Methods for Maintenance Action Recommendation

Big Data Analytics for Healthcare

Effect Size and Power

Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach

L25: Ensemble learning

Beating the MLB Moneyline

Sanjeev Kumar. contribute

Learning from Diversity

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Microarray Data Mining: Puce a ADN

Learning is a very general term denoting the way in which agents:

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Information Management course

Unsupervised Data Mining (Clustering)

DATA MINING TECHNIQUES AND APPLICATIONS

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Going Big in Data Dimensionality:

User Authentication/Identification From Web Browsing Behavior

Introduction to Data Mining

Fast Analytics on Big Data with H20

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Making Sense of the Mayhem: Machine Learning and March Madness

Chapter 12 Bagging and Random Forests

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Protein Protein Interaction Networks

Machine Learning and Statistics: What s the Connection?

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Predicting borrowers chance of defaulting on credit loans

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

On the effect of data set size on bias and variance in classification learning

How To Perform An Ensemble Analysis

Final Project Report

Scalable Developments for Big Data Analytics in Remote Sensing

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

HT2015: SC4 Statistical Data Mining and Machine Learning

Transcription:

Convegno Calcolo ad Alte Prestazioni "Biocomputing" Bio-molecular diagnosis through Random Subspace Ensembles of Learning Machines Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini DSI Dipartimento di Scienze dell Informazione Università degli Studi di Milano {bertoni,folgieri,valentini}@dsi.unimi.it http://homes.dsi.unimi.it/~valenti

Outline Bio-molecular diagnosis of tumors using machine learning methods Current approaches to automatic bio-molecular diagnosis Random Subspace (RS) ensemble: experimental results on a case study Combining feature selection and RS ensemble On-going work: RP-ensembles

Bio-molecular diagnosis of malignancies: motivations Traditional clinical diagnostic approaches may sometimes fail in detecting tumors (Alizadeh et al. 2001) Several results showed that bio-molecular analysis of malignancies may help to better characterize malignancies (e.g. gene expression profiling) Information for supporting both diagnosis and prognosis of malignancies at bio-molecular level may be obtained from high-throughput biotechnologies (e.g. DNA microarray)

Bio-molecular diagnosis of malignancies: current approaches Huge amount of data available from biotechnologies: analysis and extraction of significant biological knowledge is critical Current approaches: statistical methods and machine learning methods (Golub et al., 1999; Furey et al., 2000; Ramaswamy et al., 2001; Dudoit et al. 2002; Lee & Lee, 2003; Weston et al., 2003, Dettling et al., 2003, Dettling 2004, Zhou et al, 2005, Zhang et al., 2006).

Main problems with gene expression data for bio-molecular diagnosis High dimensionality Low cardinality Curse of dimensionality Data are usually noisy: Gene expression measurements Labeling errors

Current approaches against the curse of dimensionality Selection of significant subsets of components (genes) e.g.: filter methods, forward selection, backward selection, recursive feature elimination, entropy and mutual information based feature selection methods (see Guyon & Ellisseef, 2003 for a recent review). Extraction of significant subsets of features e.g.: Principal Component Analysis or Independent Component Analysis Anyway, both approaches have problems...

An alternative approach based on ensemble methods Random subspace (RS) ensembles: RS (Ho, 1998) reduce the high dimensionality of the data by randomly selecting subsets of genes. Aggregation of different base learners trained on different subsets of features may reduce variance and improve diversity D 1 h 1 D Algorithm Aggregation h D m h m

The RS algorithm Input: a d-dimensional labelled gene expression data set D - a learning algorithm L - subspace dimension n<d - number of the base learners I Output: - Final hypothesis h ran :X C computed by the ensemble begin for i = 1 to I begin D i = Subspace_projection(D,n) H i = L(D i ) end h ran (x)=argmax t C card({i h i (x)=t}) end

Reasons for applying RS ensembles to the bio-molecular diagnosis of tumors Gene expression data are usually very high dimensional, and RS ensembles reduce the dimensionality and are effective with high dimensional data (Skurichina and Duin, 2002) Co-regulated genes show correlated gene expression levels (Gasch and Eisen, 2002), and RS ensembles are effective with correlated sets of features (Bingham and Mannila, 2001) Random projections may improve the diversity between base learners Overall accuracy of the ensemble may be enhanced through aggregation techniques (at least w.r.t. the variance component of the error)

Colon adenocarcinoma diagnosis Data (Alon et al., 1999): 62 samples 40 colon tumors 22 normal colon samples 2000 genes Methods: RS ensembles with linear SVMs as base learners Single linear SVMs Software: C++ NEURObjects library Hardware: Avogadro cluster of Xeon double processor workstations (Arlandini, 2005)

Results Colon tumor prediction (5 fold cross validation)

Colon tumor prediction: error as a function of the susbspace dimension Single SVM test error

Average base learner error The better accuracy of the RS ensemble does not simply depend on the better accuracy of their component base learners

- Open problems with RS methods 1. Can we explain the effectiveness of RS through the diversity of the base learners? 2. Can we get a bias-variance interpretation? 3. What about the optimal subspace dimension? 4. Are feature selection and random subspace ensemble approaches alternative, or it may be useful to combine them?

Combining feature selection and random subspace ensemble methods Random Subspace on Selected Features (RS-SF algorithm) A two-steps algorithm: 1. Select a subset of features (genes) according to a suitable feature selection method 2. Apply the random subspace ensemble method to the subset of selected features

Results on combining feature selection with random subspace ensembles Colon data set (Alon, 1999) 5-fold cross validation

Comparison with other methods Methods Estimated error LogitBoost (Dettling and Buhlmann, 2003) Bagging (Valentini et al., 2004) BagBoost (Dettling, 2004) Random Forest (Breiman, 2001) Random Subspace SVM PAM (Tibshirani et al. 2002) DLDA (Dudoit et al. 2002) knn 0.1914 0.1286 0.1610 0.1486 0.0968 0.1129 0.1190 0.1286 0.1638 Colon data set: generalization error estimated through crossvalidation or multiple-hold out techniques

An on-going development: Supervised Randomly Projected Ensembles (RP-ensembles): Recent work on unsupervised analysis of complex bio-molecular data (Bertoni and Valentini, 2006) showed that random projections obeying the Johnson-Lindenstrauss lemma can be used for: Discovering structures in bio-molecular data Validating clustering results Improving clustering results Random projections to lower dimensional subspaces can be applied to supervised analysis (e.g. bio-molecular diagnosis)?

Conclusions RS ensembles can improve the accuracy of biomolecular diagnosis characterized by very high dimensional data They could be also easily applied to heterogeneous bio-molecular and clinical data. A new promising approach consists in combining state of the art feature (gene) selection methods and RS ensembles RS ensembles are computationally intensive but can be easily parallelized using clusters of workstations (e.g. in a MPI framework).