Microarray Data Mining: Puce a ADN

Similar documents

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Introduction to Data Mining

Data Mining. Nonlinear Classification

Final Project Report

Social Media Mining. Data Mining Essentials

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Measuring gene expression (Microarrays) Ulf Leser

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Gene expression analysis. Ulf Leser and Karin Zimmermann

Knowledge Discovery and Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

An Introduction to Data Mining

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Knowledge Discovery and Data Mining

Chapter 6. The stacking ensemble approach

Row Quantile Normalisation of Microarrays

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

Basic Analysis of Microarray Data

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Survey of clinical data mining applications on big data in health informatics

Introduction to Pattern Recognition

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Classification and Regression by randomforest

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Gene Expression Analysis

The Data Mining Process

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Data Mining Algorithms Part 1. Dejan Sarka

Azure Machine Learning, SQL Data Mining and R

Data Mining and Machine Learning in Bioinformatics

Data Mining - Evaluation of Classifiers

Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

How To Solve The Kd Cup 2010 Challenge

Supervised Learning (Big Data Analytics)

Introduction To Real Time Quantitative PCR (qpcr)

Quality Assessment of Exon and Gene Arrays

Materials and Methods. Blocking of Globin Reverse Transcription to Enhance Human Whole Blood Gene Expression Profiling

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

Knowledge Discovery and Data Mining

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Maschinelles Lernen mit MATLAB

Model Selection. Introduction. Model Selection

A Review of Data Mining Techniques

diagnosis through Random

Model Combination. 24 Novembre 2009

Predictive Data modeling for health care: Comparative performance study of different prediction models

Supervised Feature Selection & Unsupervised Dimensionality Reduction

The Scientific Data Mining Process

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

How many of you have checked out the web site on protein-dna interactions?

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Environmental Remote Sensing GEOG 2021

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Chapter 12 Discovering New Knowledge Data Mining

2 Decision tree + Cross-validation with R (package rpart)

A leader in the development and application of information technology to prevent and treat disease.

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Statistical issues in the analysis of microarray data

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Using multiple models: Bagging, Boosting, Ensembles, Forests

Real-time PCR: Understanding C t

A Survey on Intrusion Detection System with Data Mining Techniques

KEITH LEHNERT AND ERIC FRIEDRICH

Gene Expression Assays

Ensemble Learning of Colorectal Cancer Survival Rates

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Machine Learning with MATLAB David Willingham Application Engineer

Data Mining On Diabetics

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013

Data Mining Part 5. Prediction

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

An Overview of Knowledge Discovery Database and Data mining Techniques

Microsoft Azure Machine learning Algorithms

Medical Informatics II

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Data Mining Techniques for Prognosis in Pancreatic Cancer

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Transcription:

Microarray Data Mining: Puce a ADN Recent Developments Gregory Piatetsky-Shapiro KDnuggets EGC 2005, Paris 2005 KDnuggets EGC 2005

Role of Gene Expression Cell Nucleus Chromosome Gene expression Protein Gene (mrna), single strand Gene (DNA) Graphics courtesy of the National Human Genome Research Institute 2

Gene Expression Cells are different because of differential gene expression. About 40% of human genes are expressed at one time. Gene is expressed by transcribing DNA exons into single-stranded mrna mrna is later translated into a protein Microarrays measure the level of mrna expression 3

Gene Expression Measurement mrna expression represents dynamic aspects of cell mrna expression can be measured with latest technology mrna is isolated and labeled with fluorescent protein mrna is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser 4

Affymetrix DNA Microarrays image courtesy of Affymetrix 5

Some Probes Match Match Probe T A C T C Gene A T G A G T T 6

Other Probes MisMatch Gene A T G A G MisMatch Probe T A T T C T T 7

Affymetrix Microarray Concept 1. mrna segments tagged with fluorescent chemical 2. Matches with complementary probes 3. Fluorescence measured with laser image courtesy of Affymetrix 8

Affymetrix Microarray Raw Image Gene Value D26528_at 193 D26561_cds1_at -70 D26561_cds2_at 144 D26561_cds3_at 33 D26579_at 318 D26598_at 1764 D26599_at 1537 D26600_at 1204 D28114_at 707 enlarged section of raw image Scanner raw data 9

Microarray Potential Applications New and better molecular diagnostics Jan 11, 2005: FDA approved Roche Diagnostic AmpliChip, based on Affymetrix technology New molecular targets for therapy few new drugs, large pipeline, Improved treatment outcome Partially depends on genetic signature Fundamental Biological Discovery finding and refining biological pathways Personalized medicine?! 10

Microarray Data Analysis Types Gene Selection Find genes for therapeutic targets (new drugs) Classification (Supervised) Identify disease Predict outcome / select best treatment Clustering (Unsupervised) Find new biological classes / refine existing ones Discovery of Associations and Pathways 11

Outline DNA and Biology Microarray Classification - Best practices Gene Set Analysis Synthetic microarray data sets 12

Microarray Data Analysis Challenges Biological, Process, and Other variation Model needs to be explainable to biologists Few records (samples), usually < 100 Many columns (genes), usually > 1,000 This is very likely to result in false positives, discoveries due to random noise 13

Data Mining from Small Data: Example Les Américains boivent beaucoup de vin et ont beaucoup de maladie de coeur Les Francais boivent beaucoup de vin et ont peu de maladie de coeur Les Anglais boivent beaucoup de biere et ont beaucoup de maladie de coeur Les Allemands boivent beaucoup de biere et ont peu de maladie de coeur 14

Conclusion Vous pouvez boire tout que vous vouliez C est de parler en anglais que cause la maladie de coeur 15

Same Data, Different Results Who is Right? Many examples (e.g. CAMDA conferences) where researchers analyzed the same data, and found different results and gene sets. Who is right? Good methodology is needed for avoiding errors for best results 16

Capturing Best Practices for Microarray Data Analysis Worked with SPSS and S. Ramaswamy (MIT / Whitehead Institute) to capture best practices. Implemented as SPSS Microarray CATs (Clementine Application Template). Also implemented using Weka and open-source software Capturing Best Practice for Microarray Gene Expression Data Analysis, G. Piatetsky- Shapiro, T. Khabaza, S. Ramaswamy, Proceedings of KDD-2003 www.kdnuggets.com/gpspubs/ 17

Best Practices Capture the complete process, from raw data to final results Gene (feature) selection inside cross-validation Randomization testing Robust classification algorithms Simple methods give good results Advanced methods can be better Wrapper approach for best gene subset selection Use bagging to improve accuracy Remove/relabel mislabeled or poorly differentiated samples 18

Observations Simplest approaches are most robust Advanced approaches can be more accurate Small increase in diagnostic accuracy (e.g. 90% to 98%) can greatly reduce rate of false positives (5-fold) 1 0.8 0.6 0.4 0.2 p(disease test=yes), disease rate=0.01 0 0 10 20 30 40 50 60 70 80 90 100 Accuracy 19

Microarray Classification Desired Features Robust in presence of false positives Stable under cross-validation Results understandable by biologists Return confidence/probability Fast enough 20

Popular Classification Methods Decision Trees/Rules Find smallest gene sets, but not robust poor performance Neural Nets - work well for reduced number of genes K-nearest neighbor good results for small number of genes, but no model Naïve Bayes simple, robust, but ignores gene interactions Support Vector Machines (SVM) Good accuracy, does own gene selection, but hard to understand Specialized methods, D/S/A (Dudoit), 21

Gene Reduction Improves Classification Accuracy - Why? Most learning algorithms look for non-linear combinations of features Can easily find spurious combinations given few records and many genes false positives problem Classification accuracy improves if we first reduce number of genes by a linear method e.g. T-values of mean difference 22

Feature selection approach Rank genes by measure & select top 100-200 T-test for Mean Difference= Signal to Noise (S2N) = Other methods Avg 1 Avg 2 2 / N 1 1 2 / N 2 2 Avg 1 Avg 2 1 2 23

Global Feature / Parameter Selection is wrong Gene data Class data Data Cleaning & Preparation Feature/Parameter Selection Train data Model Building Test data Evaluation Because the information is leaked via gene selection. Leads to overly optimistic classification results. 24

Global Feature Selection Bias Example: Data w 6000 features, ~ 100 samples Used 3 Weka algorithms: SVM, IB1, J48 Error with Global feature selection was 50% to 150% lower than using X-validation Average % increase in error from using Global Feature Selection (A) 200 150 100 50 0-50 5 10 20 30 50 Best Features IB1 J48 SVM 25

Microarray Classification Process Train data Data Cleaning & Preparation Gene data Class data Feature/Parameter Selection Model Building Test data Evaluation 26

Evaluation Evaluation on one test dataset is not accurate (for small datasets) Evaluation, including feature/parameter selection needs to be inside cross-validation (Validation Croisee) 27

Evaluation inside X-validation Gene Data T r a i n Microarray Classification Process Final Test 1 Final Model 1 Final Result 1 28

Evaluation inside X-validation, 2 Gene Data T r a i n Microarray Classification Process Final Test 2 Final Model 2 Final Result 2 29

Evaluation inside X-validation, N Gene Data Final Test N T r a i n Microarray Classification Process Final Model N Final Result N 30

Iterative Wrapper approach to selecting the best gene set Model with top 100 genes is not optimal Test models using 1,2,3,, 10, 20, 30, 40,..., N top genes with cross-validation. Gene selection: Simple: equal number of genes from each class advanced: best number from each class For randomized algorithms (e.g. neural nets), average 10 cross-validation runs 31

Best Gene Set: one X-validation run 28% Error Avg for 10-fold X-val 25% 23% 20% 18% 15% 13% 10% Single Cross-Validation run 8% 32

Best Gene Set 10 X-validation runs Select gene set with lowest combined Error good, but not optimal! Average, high and low error rate for all classes 33

Multi-class classification Simple: One model for all classes Advanced: Separate model for each class 34

Error rates for each class Error rate Genes per Class 35

Example: Pediatric Brain Tumor Data 92 samples, 5 classes (MED, EPD, JPA, MGL, RHB) from U. of Chicago Children s Hospital Photomicrographs of tumours (400x) 36

Single Neural Net Class MED MGL RHB EPD JPA *ALL* 1-Net Error Rate 2.1% 17% 24% 9% 19% 8.3% 37

Bagging Improves Results Bagging or simple voting of N different neural nets improves the accuracy 38

Bagging 100 Networks Class 1-Net Error Rate Bag Error rate Bag Avg Conf MED 2.1% 2% (0)* 98% MGL 17% 10% 83% RHB 24% 11% 76% EPD 9% 0 91% JPA 19% 0 81% *ALL* 8.3% 3% (2)* 92% Note: suspected error on one sample (labeled as MED but consistently classified as RHB) 39

Cross-validated prediction strength misclassified? MED: Strong predictions poorly differentiated 40

Outline DNA and Biology Microarray Classification - Best practices Gene Set Analysis Synthetic microarray data sets 41

Gene Sets Key to Power Number of genes is a problem for single gene analysis Analyzing gene sets increases statistical power Group genes statistically (prototypes, correlations) Group genes using biological knowledge (pathways, Gene Ontology, medical text, ) 42

Gene Set Analysis Gene Set Enrichment Mootha et al, Nature Genetics, July 2003 Question: Genes involved in oxidative phosphorylation (~100) vs. diabetes? No genes in OXPHOS group are individually significant (avg. ~20% decrease) 43

OXPHOS Gene set example Decrease is consistent for 89% of OXPHOS genes Mean expression of all genes (gray) and of OXPHOS genes (red) for people with DM2 (Type 2 diabetes mellitus) vs Normal. 44

Outline DNA and Biology Microarray Classification - Best practices Gene Set Analysis Synthetic microarray data sets 45

Synthetic Microarray Data Getting high accuracy is very important Current data has too few samples; the true labels may be unknown. What methods are best for given data and what is their estimated accuracy? Proposal: generate synthetic but realistic microarray data with KNOWN labels to evaluate algorithms under different conditions 46

First Step: Analysis of Existing Microarray Datasets Dataset Platform, Software Tissue Genes Samples Classes Golub train Affy HuGeneFL; MAS4 Blood 7070 38 2 Golub test Affy HuGeneFL; MAS4 Blood 7070 34 2 Mich Lung Affy HuGeneFL Lung 7070 96 3 GCM train Affy Hu6800 and Hu35KsubA Many 16004 144 14 GCM normal Affy Hu6800 and Hu35KsubA Many 16004 90 13 Bremer pp5 Affy HuGeneFL; Probe Profiler Brain 7070 92 5 Brain U133 Affy U133; Probe Profiler Brain 22215 33 4 Different Technologies, Tissues, Normal/Abnormal 47

Global Distribution of Gene Means Analyzed means of ALL genes across samples Excluded Mean < 1 (mismatch, random) Found: log(means) ~ normal distribution Gene means are globally lognormal 48

Global Gene Means (Q-Q plot) Plot: log10(means) (blue) vs normal quantiles (red) 49

Gene Expression Standard Deviation vs Mean Globally log(sd) ~ a log(means) + b 50

True Rank vs T-value Rank Simulated 100 microarray datasets, with 6000 genes and 30 samples (using R) First 10 genes up-regulated by UP factor 51

Future Research Directions Why is gene distribution log normal? Develop a synthetic microarray data generator Which gene selection methods are most robust and under what conditions? Evaluate algorithm accuracy for a given test set by creating similar synthetic sets with known answers 52

Acknowledgements Eric Bremer, Children s Hospital (Chicago) & Northwestern U. Tom Khabaza, SPSS Sridhar Ramaswamy, MIT/Broad Institute Pablo Tamayo, MIT/Broad Institute 53

SIGKDD Explorations Special Issue on Microarray Data Mining www.acm.org/sigkdd/explorations/issue5-2.htm 54

Merci! Further resources on Data Mining: www.kdnuggets.com Microarrays: www.kdnuggets.com/websites/microarray.html Data Mining Course (one semester) www.kdnuggets.com/dmcourse 55

Questions? 56