Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees



Similar documents
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Computational Assignment 4: Discriminant Analysis

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Homework Assignment 7

Lecture 3: Linear methods for classification

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

JetBlue Airways Stock Price Analysis and Prediction

Didacticiel - Études de cas

Performance Metrics for Graph Mining Tasks

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Environmental Remote Sensing GEOG 2021

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

An Overview and Evaluation of Decision Tree Methodology

Data Mining Techniques Chapter 6: Decision Trees

How To Cluster

D-optimal plans in observational studies

Using multiple models: Bagging, Boosting, Ensembles, Forests

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

Professor Anita Wasilewska. Classification Lecture Notes

Didacticiel Études de cas

2 Decision tree + Cross-validation with R (package rpart)

Data mining techniques: decision trees

Package MDM. February 19, 2015

Statistical Machine Learning

Visualization methods for patent data

Leveraging Ensemble Models in SAS Enterprise Miner

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Big Data: Rethinking Text Visualization

Model Selection. Introduction. Model Selection

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Classification/Decision Trees (II)

A short course in Longitudinal Data Analysis ESRC Research Methods and Short Course Material for Practicals with the joiner package.

STAT 503X Case Study 2: Italian Olive Oils

Data Mining. Nonlinear Classification

Lecture 9: Introduction to Pattern Analysis

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Data Mining Classification: Decision Trees

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Data Mining Methods: Applications for Institutional Research

1 Topic. 2 Scilab. 2.1 What is Scilab?

3. Data Analysis, Statistics, and Probability

What is Data mining?

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Chapter 12 Bagging and Random Forests

Ideas and Tools for a Data Mining Style of Analysis

DATA ANALYSIS II. Matrix Algorithms

The Artificial Prediction Market

Classification Problems

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Hierarchical Clustering Analysis

Classification by Pairwise Coupling

Linear Models for Classification

Data, Measurements, Features

Data Mining - Evaluation of Classifiers

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Analysis Tools and Libraries for BigData

Lecture 10: Regression Trees

Big Data Decision Trees with R

Data exploration with Microsoft Excel: analysing more than one variable

Programming Exercise 3: Multi-class Classification and Neural Networks

Decision Trees from large Databases: SLIQ

Data Mining: Algorithms and Applications Matrix Math Review

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Clustering & Visualization

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

COC131 Data Mining - Clustering

Data Mining Lab 5: Introduction to Neural Networks

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Getting Started with R and RStudio 1

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

STA 4273H: Statistical Machine Learning

Linear Discriminant Analysis

Component Ordering in Independent Component Analysis Based on Data Power

Classification and Regression by randomforest

IBM SPSS Direct Marketing 23

HT2015: SC4 Statistical Data Mining and Machine Learning

Data Analysis of Trends in iphone 5 Sales on ebay

IBM SPSS Direct Marketing 22

Machine Learning Logistic Regression

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Machine Learning and Pattern Recognition Logistic Regression

TOWARD BIG DATA ANALYSIS WORKSHOP

IBM SPSS Decision Trees 21

Classification and Prediction

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Experiments in Web Page Classification for Semantic Web

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

College Tuition: Data mining and analysis

Linear Threshold Units

Data Mining Practical Machine Learning Tools and Techniques

Statistical Models in Data Mining

Implementation of Breiman s Random Forest Machine Learning Algorithm

Business Analytics and Credit Scoring

Transcription:

Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques. Load the workspace containing the R objects and functions for this assignment, PracticalObjects.RData. Linear and Quadratic Discriminant Analysis: Cushing s Syndrome Data In this section we consider the Cushing s syndrome data, which is available in the package MASS, and use the functions lda and qda to perform linear and quadratic discriminant analysis, respectively. These two functions are also available in the package MASS. library(mass) Cushings 1. Note that there are three types of Cushing s syndrome are distinguished, coded a, b, and c. We remove the observations from the dataset with type u (unknown). cush <- Cushings[Cushings$Type!="u",] cush[,3] <- factor(cush[,3],levels=c("a","b","c")) cush.type<-cush[,3] The function factor is used to create a vector cush.type that indicates the syndrome type of each observation in the reconstructed dataset cush. The log transform of the continuous variables ensure that their distributions are more symmetrical. boxplot(cush[,1:2]) cush[,1] <- log(cush[,1]) cush[,2] <- log(cush[,2]) boxplot(cush[,1:2]) 2. First, we perform a linear discriminant analysis on the Cushing s data. The function lda accepts the continuous variables as its first argument and the class labels as its second argument. cush.lda<-lda(cush[,1:2], cush.type) cush.lda The command cush.lda is used to display some of the discriminant analysis results. Note that the linear discriminants, a 1 and a 2, are displayed.

We use the method predict to project observations onto the linear discriminants. The argument dimen specifies the number of discriminant components onto which the data will be projected. cush.pred<-predict(cush.lda, dimen=2) After the object cush.pred is created we need to extract the scores (projected observations) on the first dimen discriminant variables. cush.pr<-cush.pred$x Let us look at the data projected onto the first two discriminant components. eqscplot(cush.pr, type="n", xlab="first linear discriminant", ylab="second linear discriminant", main="lda") text(cush.pr, labels=as.character(cush.type), cex=0.8) 3. We can create a plot that indicates the linear discriminant analysis decision boundaries of the dataset cush[,1:2]. The function dec.bound.plot, in the workspace PracticalObjects, is used to create such a plot. This function only accepts datasets with two continuous variables and one categorical (or nominal) variable that indicate the group or class of each observation. The first argument of this function is the object created by using lda (or qda), and the second and third arguments are the data (continuous variables) and the class labels, respectively. dec.bound.plot(cush.lda, cush[,1:2], cush.type, "Tetrahydrocortisone", "Pregnanetriol", "LDA") Examining the plot, it is clear that the fourth, fifth, and sixth arguments of dec.bound.plot are the x axis label, y axis label, and the title of the plot, respectively. 4. The function qda can be used to perform a quadratic discriminant analysis and dec.bound.plot enables us to visualise the quadratic decision boundaries. cush.qda <- qda(cush[,1:2], cush.type) dec.bound.plot(cush.qda, cush[,1:2], cush.type, "Tetrahydrocortisone", "Pregnanetriol", "LDA") 5. The function visu.lqda, which is also in the workspace PracticalObjects, can be used to display the estimated Gaussian distributions for each class in addition to the linear or quadratic decision boundaries. For LDA the orientation and spread of the estimated densities for each class are similar. visu.lqda(cush.lda, cush) Compare this to the spread and orientation of the estimated Gaussian distributions for QDA. Are there any differences? visu.lqda(cush.qda, cush) Note that the second argument of the function visu.lqda is the full dataset (continuous variables and grouping factor). 6. The reduced rank LDA does not use all projection directions. For the Cushing s data we can use a maximum of two projection directions. The function visu.rr1.lda can

be used to fit a reduced rank LDA with rank 1 to the cush dataset, and to visualise the decision boundaries. visu.rr1.lda(cush.lda, cush) The solid black lines indicate the decision boundaries, while the coloured dashed lines indicate the projection of the data points onto one dimension. Linear and Quadratic Discriminant Analysis: Vanveer Data We consider the breast tumour data again and perform discriminant analyses on different subsets of the genes. For linear discriminant analysis we will use smaller subsets of the genes to ensure that the estimated covariance matrices are of full rank. vanv.10<-vanveer.4000[,1:11] vanv.20<-vanveer.4000[,1:21] vanv.prog<-vanveer.4000[,1] 1. In this section we will look more closely at the proportion of observations that are accurately classified when we us the LDA method. We perform an LDA on the dataset containing the subset of 10 best genes. As above we use the function predict to extract information about the projection of the data onto the linear discriminant components. vanv.lda.10 <- lda(vanv.10[,2:10],vanv.prog) vanv.pred.10<-predict(vanv.lda.10) Calculate the proportion of patients that are accurately classified. The function table counts the number of patients with prognosis poor that are classified as good or poor, and the number of patients with prognosis good that are classified as good or poor. table(vanv.progn, vanv.pred.10$class) How many patients are misclassified? We can adjust the above commands to see how many patients are misclassified when we use the subset of 20 best genes. vanv.lda.20 <- lda(vanv.20[,2:21],vanv.prog) vanv.pred.20<-predict(vanv.lda.20) table(vanv.progn, vanv.pred.20$class) 2. Note that we used the same data to fit and assess the classification performance. A more accurate estimate of the misclassification rate can be obtained by dividing the dataset into a training set and a test set. We randomly select patients for the training set (roughly 60% of the patients), and the remaining patients constitute the test set. train<-runif(nrow(vanv.10))<0.60 test<-!train The vectors train and test are indicator vectors they indicate which observations in the dataset vanv.10 belong to the training set, and which belong to the test set.

We perform a linear discriminant analysis on the data in the training set. However, we predict the data in the test set on the linear discriminant components. To calculate the number of misclassifications in the test set we compare the predicted classes of the observations in the test set with their true classes. vanv.lda.10 <- lda(vanv.10[train,2:11],vanv.prog[train]) vanv.pred.10<-predict(vanv.lda.10, vanv.10[test,2:11]) table(vanv.progn[test], vanv.pred.10$class) How many patients in the test set are misclassified, for the subset of 10 best genes? Decision Trees: Cushing s Syndrome Data In this section we look at the Cushing s syndrome data in the library MASS. 1. First we grow a full decision tree, and examine splits that are carried out. The function rpart in the package rpart is used to grow a tree. First we specify the arguments formula and data. The argument formula requires that we indicate the grouping factor on the left hand side of the and the other variables, which determine the grouping, on the right hand side. The argument data is used to indicate the data frame which contains the variable named in the formula. Note how we specify the formula: Type.. Here Type is the name of the variable that contains the grouping/class label of each observation, and the. indicates all the other variables in the data frame. Two control parameters are specified as well: minsplit and cp. The former indicates the minimum number of observations that must exist in a node in order for a split to be attempted while the complexity parameter cp specifies that any split that does not decrease the overall lack of fit by a factor of cp should not be attempted. (Use the command?rpart.control to look at various parameters that control aspects of the rpart fit.) library(rpart) cush.tree <- rpart(type~.,data=cush,cp=0,minsplit=3) summary(cush.tree) plot(cush.tree) text(cush.tree) The function summary can be used to display details of the fitted tree. The function plot.partition in the workspace PracticalObjects is used to visualise the partitioning of the Cushing s data in two dimensions. Note that this function can only be used to visualise datasets with two continuous variables (which explains the grouping). plot.partition(cush.tree, cush) 2. To prune the decision tree we can use cross-validation to pick a subtree. The function plotcp gives a visual representation of the cross-validation results of an rpart object. The vertical lines show the standard errors of the respective subtrees while the horizontal

line is one standard error worse than the best subtree. A good choice of the size of the subtree is often the leftmost value for which the mean lies below the horizontal line. plotcp(cush.tree) Accordingly, we choose the subtree of size three and with a cp value of 0.15 for the Cushing s dataset. The function prune is used to prune the full decision tree to a subtree of a specific size. We have to specify the value of cp when we use the function prune to ensure that the tree is pruned to the desired extent. cush.prune <- prune(cush.tree,cp=0.15) plot(cush.prune) text(cush.prune) Look at the partitioning of the feature space that corresponds to this decision tree. plot.partition(cush.prune, cush) How many data points are misclassified? Decision Trees: Singh Data In this section we use the prostate cancer data of Singh et al. (2002), discussed in Example 2, Section 3.3.1. We examine the performance of decision trees on gene expression data. 1. The full decision tree The Singh dataset consists of a training and test set, singh.train and singh.test, respectively. These datasets are stored in the workspace PracticalObjects. Let us examine the Singh data set. First, we can look at the names of the variables in the dataset by using the function dimnames. Second, it is also of interest to find the dimensions of each dataset. For this purpose we use the functions nrow and ncol which gives the number of rows and columns of the dataset, respectively. dimnames(singh.train)[[2]] nrow(singh.train) ncol(singh.train) nrow(singh.test) ncol(singh.test) We note that the first variable in the Singh dataset, outcome, indicates whether it is a tumour or non-tumour sample. How are these two classes indicated? levels(singh.train[,1]) A full tree is grown for the training data, and plotted. singh.tree <- rpart(outcome~.,data=singh.train,cp=0) plot(singh.tree) text(singh.tree) How many genes are used to classify the data?

2. The performance of the tree is evaluated by calculating the training error and test error. To determine the classification of each observation in the training set we use the function predict. singh.prob.pred<-predict(singh.tree, singh.train) singh.prob.pred If the posterior probability for the class label normal is larger than 0.5, then an observation is classified as normal, and as tumour otherwise. Let us create a vector that indicates the classification of each observation in the training dataset. The function ifelse can be used to create such a vector. singh.class<-ifelse(singh.prob.pred[,1]>0.5,"normal","tumour") singh.class Read more about this function (?ifelse). Now we can compute the training error by comparing the vector singh.train$outcome (which contains the class labels for the training set) with the vector singh.class (containing the predicted classes). The training error is the number of misclassified observations divided by the total number of observations in the training set. singh.train.error<-sum(singh.class!=singh.train$outcome)/ nrow(singh.train) Make sure you understand how we calculated the training error by breaking the above command down into smaller parts: misclass.vector<- singh.class!=singh.train$outcome number.misclass<-sum(misclass.vector) number.obs.train.set<- nrow(singh.train) singh.train.error<- number.misclass / number.obs.train.set misclass.vector number.misclass singh.train.error The test error can be calculated by predicting the classes of the test set first. singh.prob.pred<-predict(singh.tree,singh.test) Exercises 1. LDA and QDA Adjust the commands given in point?? (LDA and QDA: Vanveer Data section) and calculate the proportion of misclassifications in the test set for the subsets of 15 and 20 best genes. Compare these proportions to the proportion of misclassifications calculated at point?? when we used the whole dataset to perform an LDA and to assess its performance. Do you notice any differences? 2. We can use the same training set and test set as defined above (point??) and assess the performance of quadratic discriminant analysis. The quadratic decision boundaries are more flexible than the linear decision boundaries (as seen in the plots constructed for

the Cushing s data). This can create the expectation that QDA will lead to a lower rate of misclassification. Adjust the above code to compute the proportion of misclassified patients for the datasets containing the subsets of 10 and 15 best genes, respectively. Compare these proportions to the corresponding proportions of misclassifications obtained by using LDA. Does QDA perform better than LDA, when we compare the error rates? Is the proportion of misclassifications high overall? 3. Trees: Singh Data Return to point?? (Trees: Singh Data section) in the practical assignment. Calculate the test error for the Singh data. Compare this with the training error. Is the test error large relative to the training error?