Decision Trees and Random Forests. Reference: Leo Breiman,

Similar documents
Decision Trees and Random Forests. Reference: Leo Breiman,

Decision Trees from large Databases: SLIQ

Data Mining Classification: Decision Trees

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Functional Data Analysis of MALDI TOF Protein Spectra

Introduction to Learning & Decision Trees

Aiping Lu. Key Laboratory of System Biology Chinese Academic Society

Data Mining. Nonlinear Classification

Data Mining for Knowledge Management. Classification

Comparison of Data Mining Techniques used for Financial Data Analysis

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Knowledge Discovery and Data Mining

Preprocessing, Management, and Analysis of Mass Spectrometry Proteomics Data

Fast Analytics on Big Data with H20

Electrospray Ion Trap Mass Spectrometry. Introduction

Using Random Forest to Learn Imbalanced Data

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Classification and Prediction

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Using multiple models: Bagging, Boosting, Ensembles, Forests

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

13C NMR Spectroscopy

Data Mining Practical Machine Learning Tools and Techniques

Model Combination. 24 Novembre 2009

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

1. Classification problems

LC-MS/MS for Chromatographers

Studying Auto Insurance Data

Supervised Learning (Big Data Analytics)

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

SVM Ensemble Model for Investment Prediction

Mass Spectrometry Based Proteomics

An Approach to Detect Spam s by Using Majority Voting

Better credit models benefit us all

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Distributed forests for MapReduce-based machine learning

Lecture 10: Regression Trees

Integrated Data Mining Strategy for Effective Metabolomic Data Analysis

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Chapter 7. Feature Selection. 7.1 Introduction

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Chapter 6. The stacking ensemble approach

Random forest algorithm in big data environment

Protein Protein Interaction Networks

Fraud Detection for Online Retail using Random Forests

Social Media Mining. Data Mining Essentials

Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Predicting borrowers chance of defaulting on credit loans

Factors Influencing LC/MS/MS Moving into Clinical and Research Laboratories

Introduction to mass spectrometry (MS) based proteomics and metabolomics

Gerry Hobbs, Department of Statistics, West Virginia University

Pesticide Analysis by Mass Spectrometry

Introduction to Proteomics 1.0

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Spam Detection A Machine Learning Approach

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Roulette Sampling for Cost-Sensitive Learning

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

m/z

How To Cluster

Classification of Bad Accounts in Credit Card Industry

Tutorial for Proteomics Data Submission. Katalin F. Medzihradszky Robert J. Chalkley UCSF

Data Mining Part 5. Prediction

Overview. Triple quadrupole (MS/MS) systems provide in comparison to single quadrupole (MS) systems: Introduction

Mining the Software Change Repository of a Legacy Telephony System

Machine Learning Logistic Regression

MHI3000 Big Data Analytics for Health Care Final Project Report

Machine Learning Final Project Spam Filtering

Professor Anita Wasilewska. Classification Lecture Notes

Monday Morning Data Mining

AB SCIEX TOF/TOF 4800 PLUS SYSTEM. Cost effective flexibility for your core needs

Data Mining Techniques for Prognosis in Pancreatic Cancer

Agilent 6000 Series LC/MS Solutions CONFIDENCE IN QUANTITATIVE AND QUALITATIVE ANALYSIS

Classification algorithm in Data mining: An Overview

Time-of-Flight Mass Spectrometry

SELDI-TOF Mass Spectrometry Protein Data By Huong Thi Dieu La

Beating the MLB Moneyline

Waters Core Chromatography Training (2 Days)

Data Mining - Evaluation of Classifiers

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Knowledge Discovery and Data Mining

Data mining techniques: decision trees

D-optimal plans in observational studies

Package trimtrees. February 20, 2015

CHE334 Identification of an Unknown Compound By NMR/IR/MS

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Transcription:

Decision Trees and Random Forests Reference: Leo Breiman, http://www.stat.berkeley.edu/~breiman/randomforests 1. Decision trees Example (Guerts, Fillet, et al., Bioinformatics 2005): Patients to be classified: normal vs. diseased

Decision trees Classification of biomarker data: large number of values (e.g., microarray or mass spectrometry analysis of biological sample)

Decision trees Mass spectrometry parameters or gene expression parameters (around 15k values) Given new patient with biomarker data, is s/he normal or ill?

Decision trees Needed: selection of relevant variables from many 8 x 3 3 3œ" Number 8 of known examples in H œ ÖÐ ßC Ñ is small (characteristic of data mining problems) Assume we have for each biological sample a feature vector x, and will classify it: diseased: Cœ"à normal: Cœ "Þ Goal: find function 0ÐxÑ C which predicts C from x.

Decision trees How to estimate error of 0ÐxÑ and avoid over-fitting the small dataset H? Use cross-validation to test predictor part of the sample HÞ 0ÐxÑ in an unexamined For biological sample, feature vector x œðbßáßb ". Ñ consists of features ( or biomarkers or attributes) B3 œ E3 describing the biological sample from which x is obtained.

The decision tree approach Decision tree approach to finding predictor 0ÐxÑ œ C on data set H: based Š form a tree whose nodes are attributes B3 œ E3 in x Š decide which attributes E3 to look at first in predicting C from x find those with highest information gain - place these at top of tree Š then use recursion to form sub-trees based on attributes not used in the higher nodes:

The decision tree approach Advantages: interpretable, easy to use, scalable, robust

Decision tree example Example 1 (Moore): UCI data repository (http://www.ics.uci.edu/~mlearn/mlrepository.html) MPG ratings of cars: Goal: predict MPG rating of a car from a set of attributes E 3

Decision tree example Examples (each row is attribute set for a sample car): R. Quinlan

Decision tree example Simple assessment of information gain: how much does a particular attribute E 3 help to classify a car with respect to MPG?

Decision tree example Begin the decision tree: start with most informative

Decision tree example criterion, cylinders:

Decision tree example Recursion: build next level of tree. Initially have: Now build sub-trees: split each set of cylinder numbers into

Decision tree example further groups- Resulting next level:

Final tree: Decision tree example

Decision tree example

Points: Decision tree example è Don't split node if all records have same value (e.g. cylinders œ'ñ è Don't split node if can't have more than 1 child (e.g. acceleration œ medium)

Pseudocode: Program Tree(Input, Output) If all output values are the same, then return leaf (terminal) node which predicts the unique output If input values are balanced in a leaf node (e.g. 1 good, 1 bad in acceleration) then return leaf predicting majority of outputs on same level (e.g. bad in this case) Else find attribute with highest information gain E 3 If attribute at current node has values E 7 3

then Return internal (non-leaf) node with 7 children Build child 3 by calling Tree(NewIn, NewOut), where NewIn œ values in dataset consistent with value E 3 and all values above this node

Another decision tree: prediction of wealth from census data (Moore):

Prediction of age from census:

Prediction of gender from census: A. Moore

2. Important point: always cross-validate It is important to test your model on new data (test data) which are different from the data used to train the model (training data). This is cross-validation. Cross-validation error 2% is good; 40% is poor.

3. Background: mass spectroscopy What does a mass spectrometer do? 1. It measures masses of molecules better than any other technique. 2. It can give information about chemical structures of molecules.

Mass spectroscopy How does it work? 1. Takes unknown molecule Mß adds 3 protons to it giving it charge 3(forming MH Ñ 2. Accelerates ion MH 3 in known electric field I. 3. Measures time of flight along a known distance H. 4. Time X of flight is inversely proportional to electric charge 3 and proportional to mass 7 of ion. 3

Thus Mass spectroscopy Xº3Î7 So mass spectrometer measures ratio of charge 3 (also known as D) and 7, i.e., 3Î7 œ DÎ7Þ With a large number of molecules in a biosample, this gives a spectrum of DÎ7 values, which allows identification of molecules in sample (here IgG œ immunoglobin G)

Mass spectroscopy

Mass spectroscopy What are the measurements good for? To identify, verify, and quantify: metabolites, proteins, oligonucleotides, drug candidates, peptides, synthetic organic chemicals, polymers Applications of Mass Spectrometry Biomolecule characterization Pharmaceutical analysis Proteins and peptides Oligonucleotides

Mass spectroscopy How does a mass spectrometer work?

Mass spectroscopy [Source: Sandler Mass Spectroscopy]

Two types of ionization: Mass spectroscopy 1. Electrospray ionization (ESI):

Mass spectroscopy SMS

3 Mass spectroscopy [above MH denotes molecule with 3 protons (H ) attached]

2. MALDI: Mass spectroscopy SMS

Mass spectroscopy Mass analyzers separate ions based on their mass-tocharge ratio (m/z) Operate under high vacuum Measure mass-to-charge ratio of ions (m/z)

Components: Mass spectroscopy 1. Quadrupole Mass Analyzer (filter) Uses a combination of RF and DC voltages to operate as a mass filter before masses are accelerated. Has four parallel metal rods. Lets one mass pass through at a time. Can scan through all masses or only allow one fixed mass.

Mass spectroscopy

Mass spectroscopy 2. Time-of-flight (TOF) Mass Analyzer Accelerates ions with electric field, detects them, analyzes flight time.

Mass spectroscopy

Mass spectroscopy Ions are formed in pulses. The drift region is field free. Measures the time for ions to reach the detector. Small ions reach the detector before large ones. Ion trap mass analyzer:

Mass spectroscopy

Mass spectroscopy 3. Detector: Ions are detected with a microchannel plate:

Mass spectroscopy Mass spectrum shows the results: ESI Spectrum of Trypsinogen (MW 23983)

Mass spectroscopy

Mass spectroscopy 4. Dimensional reduction (G. Izmirlian): Sometimes we perform a dimension reduction by reducing mass spectrum information of human subject 3 to store only peaks:

Mass spectroscopy

Mass spectroscopy Then have (compressed) peak information in feature vector x œðbßáßb ". Ñ, >2 with B5 œ location of 5 mass spectrum peak (above a fixed threshold). Compressed or not, outcome value to feature vector subject 3 is C œ "Þ 3 x 3 for

5. Random forest example Example (Guerts, et al.): Normal/sick dichotomy for RA and for IBD (above - Geurts, et al.): we now build a forest of decision trees based on differing attributes in the nodes:

Random forest application For example: Could use mass spectroscopy data to determine disease state

Random forest application Mass Spec segregates protein and other molecules through spectrum of 7ÎD ratios ( 7 œ mass; D œ charge).

Random forest application Geurts, et al.

Random Forests: Advantages: accurate, easy to use (Breiman software), fast, robust Disadvantages: difficult to interpret More generally: How to combine results of different predictors (e.g. decision trees)? Random forests are examples of ensemble methods, which combine predictions of weak classifiers :Ð Ñ. 3 x

Ensemble methods: observations 1. Boosting: As seen earlier, take linear combination of predictions :Ð 3 xñ by classifiers 3(assume these are decision trees) 0ÐxÑ œ " + 3: 3 ÐxÑ, (1) if tree predicts illness where :Ð Ñœ " 3 3 x œ, " otherwise >2 and predict Cœ" if 0ÐxÑ! and Cœ " if 0ÐxÑ!Þ 3

Ensemble methods: observations 2. Bagging: Take a vote: majority rules (equivalent in this case to setting + 3 œ " for all 3 in (1) above). Example of a Bagging algorithm is random forest, where a forest of decision trees takes a vote.

General features of a random forest: If original feature vector x has. features E ßáßE,. ". Each tree uses a random selection of 7 È. features 7 ÖE 3 4œ" chosen from all features E", E# ß á, E. ; 4 the associated feature space is different for each tree and denoted by Jß"Ÿ5ŸOœ 5 # trees. (Often O œ # trees is large; e.g., O œ &!!ÑÞ For each split in a tree node based on a given variable choose the variable from information content. E 3

Information content in a node To compute information content of a node: Assume input set to node is W: then information content of node R is

Information content in a node MÐRÑ œ lwl LÐWÑ lw l LÐW Ñ lw l LÐW Ñß P P V V where lwl œ input sample size; lwpßvl œ size of leftß right subclasses of W LÐWÑ œ Shannon entropy of W œ ": log : 3œ " 3 # 3 with : œ TÐG s lwñ œ proportion of class G in sample W. 3 3 3

Information content in a node [later we will use Gini index, another criterion] Thus LÐWÑ œ "variablity" or "lack of full information" in the probabilities : 3 forming sample W input into current node R. MÐRÑ œ "information from node R" Þ For each variable E3, average over all nodes R in all trees involving this variable to find average information content LavÐE3Ñof E3.

Information content in a node (a) Rank all variables E 3 according to information content (b) For each fixed 8" 8 use only the first 8" variables. Select which minimizes prediction error. 8 "

Information content in a node Geurts, et al.

Application to: Random forests: application ì early diagnosis of Rehumatoid arthritis ì rapid diagnosis of inflammatory bowel diseases (IBD) 3 patient groups (University Hospital of Liege):

Random forests: application RA IBD Disease patients 34 60 Negative controls 29 30 Inflammatory controls 40 30 Total 103 120 Mass spectra obtained by SELDI-TOF mass spectrometry on chip arrays:

Random forests: application ì Hydrophobic (H4) ì weak cation-exchange (CM10) ì strong cation-exchange (Q10)

Random forests: application Feature vectors: x J consists of about 15,000 values in each case. Effective dimension reduction method: Discretize horizontally and vertically to go from 15,000 to 300 variables

Random forests: application

Random forests: application Sensitivity and specificity:

Random forests: application Accuracy measures: DT=Decision tree; RF=random forest; 5NN = 5-nearest neighbors; Note on sensitivity and specificity: use confusion matrix Actual Condition Test outcome True False Positive TP FP Negative FN TN

Sensitivity œ Random forests: application XT XT œ XT JR Total positives Specificity œ XR XR JT œ XR Total negatives Positive predictive value œ XT XT JT œ XT Total predicted positives

Random forests: application Variable ranking on the IBD dataset: 10 most important variables in spectrum:

Random forests: application RF-based (tree ensemble) - based variable ranking vs. variable ranking by individual variable : values:

Random forests: application RA IBD

6. RF software: Spider: http://www.kyb.tuebingen.mpg.de/bs/people/spider/whatisit. html Leo Breiman: http://www.stat.berkeley.edu/~breiman/randomforests/cc_ software.htm WEKA machine learning software http://www.cs.waikato.ac.nz/ml/weka/ http://en.wikipedia.org/wiki/weka_(machine_learning)