Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Transcription

1 Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques. Load the workspace containing the R objects and functions for this assignment, PracticalObjects.RData. Linear and Quadratic Discriminant Analysis: Cushing s Syndrome Data In this section we consider the Cushing s syndrome data, which is available in the package MASS, and use the functions lda and qda to perform linear and quadratic discriminant analysis, respectively. These two functions are also available in the package MASS. library(mass) Cushings 1. Note that there are three types of Cushing s syndrome are distinguished, coded a, b, and c. We remove the observations from the dataset with type u (unknown). cush <- Cushings[Cushings$Type!="u",] cush[,3] <- factor(cush[,3],levels=c("a","b","c")) cush.type<-cush[,3] The function factor is used to create a vector cush.type that indicates the syndrome type of each observation in the reconstructed dataset cush. The log transform of the continuous variables ensure that their distributions are more symmetrical. boxplot(cush[,1:2]) cush[,1] <- log(cush[,1]) cush[,2] <- log(cush[,2]) boxplot(cush[,1:2]) 2. First, we perform a linear discriminant analysis on the Cushing s data. The function lda accepts the continuous variables as its first argument and the class labels as its second argument. cush.lda<-lda(cush[,1:2], cush.type) cush.lda The command cush.lda is used to display some of the discriminant analysis results. Note that the linear discriminants, a 1 and a 2, are displayed.

2 We use the method predict to project observations onto the linear discriminants. The argument dimen specifies the number of discriminant components onto which the data will be projected. cush.pred<-predict(cush.lda, dimen=2) After the object cush.pred is created we need to extract the scores (projected observations) on the first dimen discriminant variables. cush.pr<-cush.pred$x Let us look at the data projected onto the first two discriminant components. eqscplot(cush.pr, type="n", xlab="first linear discriminant", ylab="second linear discriminant", main="lda") text(cush.pr, labels=as.character(cush.type), cex=0.8) 3. We can create a plot that indicates the linear discriminant analysis decision boundaries of the dataset cush[,1:2]. The function dec.bound.plot, in the workspace PracticalObjects, is used to create such a plot. This function only accepts datasets with two continuous variables and one categorical (or nominal) variable that indicate the group or class of each observation. The first argument of this function is the object created by using lda (or qda), and the second and third arguments are the data (continuous variables) and the class labels, respectively. dec.bound.plot(cush.lda, cush[,1:2], cush.type, "Tetrahydrocortisone", "Pregnanetriol", "LDA") Examining the plot, it is clear that the fourth, fifth, and sixth arguments of dec.bound.plot are the x axis label, y axis label, and the title of the plot, respectively. 4. The function qda can be used to perform a quadratic discriminant analysis and dec.bound.plot enables us to visualise the quadratic decision boundaries. cush.qda <- qda(cush[,1:2], cush.type) dec.bound.plot(cush.qda, cush[,1:2], cush.type, "Tetrahydrocortisone", "Pregnanetriol", "LDA") 5. The function visu.lqda, which is also in the workspace PracticalObjects, can be used to display the estimated Gaussian distributions for each class in addition to the linear or quadratic decision boundaries. For LDA the orientation and spread of the estimated densities for each class are similar. visu.lqda(cush.lda, cush) Compare this to the spread and orientation of the estimated Gaussian distributions for QDA. Are there any differences? visu.lqda(cush.qda, cush) Note that the second argument of the function visu.lqda is the full dataset (continuous variables and grouping factor). 6. The reduced rank LDA does not use all projection directions. For the Cushing s data we can use a maximum of two projection directions. The function visu.rr1.lda can

3 be used to fit a reduced rank LDA with rank 1 to the cush dataset, and to visualise the decision boundaries. visu.rr1.lda(cush.lda, cush) The solid black lines indicate the decision boundaries, while the coloured dashed lines indicate the projection of the data points onto one dimension. Linear and Quadratic Discriminant Analysis: Vanveer Data We consider the breast tumour data again and perform discriminant analyses on different subsets of the genes. For linear discriminant analysis we will use smaller subsets of the genes to ensure that the estimated covariance matrices are of full rank. vanv.10<-vanveer.4000[,1:11] vanv.20<-vanveer.4000[,1:21] vanv.prog<-vanveer.4000[,1] 1. In this section we will look more closely at the proportion of observations that are accurately classified when we us the LDA method. We perform an LDA on the dataset containing the subset of 10 best genes. As above we use the function predict to extract information about the projection of the data onto the linear discriminant components. vanv.lda.10 <- lda(vanv.10[,2:10],vanv.prog) vanv.pred.10<-predict(vanv.lda.10) Calculate the proportion of patients that are accurately classified. The function table counts the number of patients with prognosis poor that are classified as good or poor, and the number of patients with prognosis good that are classified as good or poor. table(vanv.progn, vanv.pred.10$class) How many patients are misclassified? We can adjust the above commands to see how many patients are misclassified when we use the subset of 20 best genes. vanv.lda.20 <- lda(vanv.20[,2:21],vanv.prog) vanv.pred.20<-predict(vanv.lda.20) table(vanv.progn, vanv.pred.20$class) 2. Note that we used the same data to fit and assess the classification performance. A more accurate estimate of the misclassification rate can be obtained by dividing the dataset into a training set and a test set. We randomly select patients for the training set (roughly 60% of the patients), and the remaining patients constitute the test set. train<-runif(nrow(vanv.10))<0.60 test<-!train The vectors train and test are indicator vectors they indicate which observations in the dataset vanv.10 belong to the training set, and which belong to the test set.

4 We perform a linear discriminant analysis on the data in the training set. However, we predict the data in the test set on the linear discriminant components. To calculate the number of misclassifications in the test set we compare the predicted classes of the observations in the test set with their true classes. vanv.lda.10 <- lda(vanv.10[train,2:11],vanv.prog[train]) vanv.pred.10<-predict(vanv.lda.10, vanv.10[test,2:11]) table(vanv.progn[test], vanv.pred.10$class) How many patients in the test set are misclassified, for the subset of 10 best genes? Decision Trees: Cushing s Syndrome Data In this section we look at the Cushing s syndrome data in the library MASS. 1. First we grow a full decision tree, and examine splits that are carried out. The function rpart in the package rpart is used to grow a tree. First we specify the arguments formula and data. The argument formula requires that we indicate the grouping factor on the left hand side of the and the other variables, which determine the grouping, on the right hand side. The argument data is used to indicate the data frame which contains the variable named in the formula. Note how we specify the formula: Type.. Here Type is the name of the variable that contains the grouping/class label of each observation, and the. indicates all the other variables in the data frame. Two control parameters are specified as well: minsplit and cp. The former indicates the minimum number of observations that must exist in a node in order for a split to be attempted while the complexity parameter cp specifies that any split that does not decrease the overall lack of fit by a factor of cp should not be attempted. (Use the command?rpart.control to look at various parameters that control aspects of the rpart fit.) library(rpart) cush.tree <- rpart(type~.,data=cush,cp=0,minsplit=3) summary(cush.tree) plot(cush.tree) text(cush.tree) The function summary can be used to display details of the fitted tree. The function plot.partition in the workspace PracticalObjects is used to visualise the partitioning of the Cushing s data in two dimensions. Note that this function can only be used to visualise datasets with two continuous variables (which explains the grouping). plot.partition(cush.tree, cush) 2. To prune the decision tree we can use cross-validation to pick a subtree. The function plotcp gives a visual representation of the cross-validation results of an rpart object. The vertical lines show the standard errors of the respective subtrees while the horizontal

5 line is one standard error worse than the best subtree. A good choice of the size of the subtree is often the leftmost value for which the mean lies below the horizontal line. plotcp(cush.tree) Accordingly, we choose the subtree of size three and with a cp value of 0.15 for the Cushing s dataset. The function prune is used to prune the full decision tree to a subtree of a specific size. We have to specify the value of cp when we use the function prune to ensure that the tree is pruned to the desired extent. cush.prune <- prune(cush.tree,cp=0.15) plot(cush.prune) text(cush.prune) Look at the partitioning of the feature space that corresponds to this decision tree. plot.partition(cush.prune, cush) How many data points are misclassified? Decision Trees: Singh Data In this section we use the prostate cancer data of Singh et al. (2002), discussed in Example 2, Section We examine the performance of decision trees on gene expression data. 1. The full decision tree The Singh dataset consists of a training and test set, singh.train and singh.test, respectively. These datasets are stored in the workspace PracticalObjects. Let us examine the Singh data set. First, we can look at the names of the variables in the dataset by using the function dimnames. Second, it is also of interest to find the dimensions of each dataset. For this purpose we use the functions nrow and ncol which gives the number of rows and columns of the dataset, respectively. dimnames(singh.train)[[2]] nrow(singh.train) ncol(singh.train) nrow(singh.test) ncol(singh.test) We note that the first variable in the Singh dataset, outcome, indicates whether it is a tumour or non-tumour sample. How are these two classes indicated? levels(singh.train[,1]) A full tree is grown for the training data, and plotted. singh.tree <- rpart(outcome~.,data=singh.train,cp=0) plot(singh.tree) text(singh.tree) How many genes are used to classify the data?

6 2. The performance of the tree is evaluated by calculating the training error and test error. To determine the classification of each observation in the training set we use the function predict. singh.prob.pred<-predict(singh.tree, singh.train) singh.prob.pred If the posterior probability for the class label normal is larger than 0.5, then an observation is classified as normal, and as tumour otherwise. Let us create a vector that indicates the classification of each observation in the training dataset. The function ifelse can be used to create such a vector. singh.class<-ifelse(singh.prob.pred[,1]>0.5,"normal","tumour") singh.class Read more about this function (?ifelse). Now we can compute the training error by comparing the vector singh.train$outcome (which contains the class labels for the training set) with the vector singh.class (containing the predicted classes). The training error is the number of misclassified observations divided by the total number of observations in the training set. singh.train.error<-sum(singh.class!=singh.train$outcome)/ nrow(singh.train) Make sure you understand how we calculated the training error by breaking the above command down into smaller parts: misclass.vector<- singh.class!=singh.train$outcome number.misclass<-sum(misclass.vector) number.obs.train.set<- nrow(singh.train) singh.train.error<- number.misclass / number.obs.train.set misclass.vector number.misclass singh.train.error The test error can be calculated by predicting the classes of the test set first. singh.prob.pred<-predict(singh.tree,singh.test) Exercises 1. LDA and QDA Adjust the commands given in point?? (LDA and QDA: Vanveer Data section) and calculate the proportion of misclassifications in the test set for the subsets of 15 and 20 best genes. Compare these proportions to the proportion of misclassifications calculated at point?? when we used the whole dataset to perform an LDA and to assess its performance. Do you notice any differences? 2. We can use the same training set and test set as defined above (point??) and assess the performance of quadratic discriminant analysis. The quadratic decision boundaries are more flexible than the linear decision boundaries (as seen in the plots constructed for

7 the Cushing s data). This can create the expectation that QDA will lead to a lower rate of misclassification. Adjust the above code to compute the proportion of misclassified patients for the datasets containing the subsets of 10 and 15 best genes, respectively. Compare these proportions to the corresponding proportions of misclassifications obtained by using LDA. Does QDA perform better than LDA, when we compare the error rates? Is the proportion of misclassifications high overall? 3. Trees: Singh Data Return to point?? (Trees: Singh Data section) in the practical assignment. Calculate the test error for the Singh data. Compare this with the training error. Is the test error large relative to the training error?