Computational Assignment 4: Discriminant Analysis

Transcription

1 Computational Assignment 4: Discriminant Analysis -Written by James Wilson -Edited by Andrew Nobel In this assignment, we will investigate running Fisher s Discriminant analysis in R. This is a powerful classification tool that is easy to use with basic R commands. 1. Coding your own functions in R: An important aspect of R is its flexibility in allowing the user to write functions. Until now, we have used built-in functions that are available in R; however, at some point all of these functions were manually written (perhaps by some poor student on a homework assignment). Before we begin, here are a few tips to consider when writing a function: Leave comments and documentation throughout your code. If you look at this function a year from now, it would be good to remember a) when it was written, b) what the function does, c) what the input and output variables are, and d) what the code is doing in each significant step. When documenting your code, try to make it readable for someone who has never used R. Make the function as easy to use as possible. The hope is that any code you write is one day publicly available and easy to use. If it s not easy to use, no one will use it. As you are writing the code, try parts of the function bit by bit. This way, you can catch an error before the code gets too long. Below, I provide code for Fisher s Linear Discriminant Analysis on a Gaussian dataset as an illustrative example. We will walk you through how to extend this code to the case of Quadratic discriminant analysis. All of these calculations are directly from the theory given in class. Once we make our own function, we will see how to perform qda using built-in R commands. Write the following code to a new document (a.r or.txt) and then copy and paste the function into your R console. My.Gaussian.LDA = function(training, test){ INPUT: training = n.1 x (p+1) matrix whose first p columns are observed variables and p+1st column includes the class labels. test = n.2 x p vector of observed variables with no class labels. OUTPUT: norm.vec = direction/normal vector associated with the projection of x offset = offset of projection of x onto norm.vec pred = prediction of class labels for test set 1

2 NOTES: 1) We assume that there are only 2 possible class labels. 2) We assume the classes have equal variance and that the data is Gaussian 3) The linear discriminant of the data is given by function g(x) = x^t * norm.vec + offset Extract summary information n = dim(training)[1] total number of samples p = dim(training)[2] - 1 number of variables labels = unique(training[,p+1]) the unique labels set.0 = which(training[,p+1] == labels[1]) set.1 = which(training[,p+1] == labels[2]) data.set.0 = training[set.0,1:p] observed variables for set.0 data.set.1 = training[set.1,1:p] observed variables for set.1 Calculate MLEs pi.0.hat = (1/n) * length(set.0) pi.1.hat = (1/n) * length(set.1) mu.0.hat = colsums(data.set.0)/length(set.0) mu.1.hat = colsums(data.set.1)/length(set.1) sigma.hat = ((t(data.set.0) - mu.0.hat)%*%t(t(data.set.0) - mu.0.hat) + (t(data.set.1) - mu.1.hat)%*%t(t(data.set.1) - mu.1.hat)) / (n-2) Coefficients of linear discriminant b = log(pi.1.hat/pi.0.hat) -.5*(mu.0.hat + mu.1.hat)%*%solve(sigma.hat)%*%(mu.1.hat - mu.0.hat) a = solve(sigma.hat)%*%(mu.1.hat - mu.0.hat) Prediction of test set Pred = rep(0,dim(test)[1]) g = as.matrix(test)%*%a + as.numeric(b) Pred[which(g>0)] = 1 Return the coefficients and Prediction return(list(norm.vec = a, offset = b, pred = Pred)) } Notes: To begin a function, we use a name = function(). In our example, we use the name My.Gaussian.LDA. INPUT arguments are placed within the parentheses of function( ). In our example our function requires INPUT arguments training and test. The main body of the function is placed between the braces {} at the beginning and end of the function. The return( ) command is required to provide OUTPUT. The OUTPUTs are placed in the parentheses of return( ). In our example, we will return a list with three variables: norm.vec, 2

3 offset, and pred. Importantly, without the return( ) command, there will be no OUTPUT variables when the function is used. Be sure to read through the code and make sure that you understand each of the lines as we have come across all of the commands used in the code. In the next problem we will make use of this function. Questions: (a) By using the above function, we assume that the covariance matrices of the variables of both classes of data are the same. Suppose that this was not the case. Write two lines of code that can be used in the above function to calculate the covariance matrices of each class, sigma.0 and sigma.1. (b) Given sigma.0 and sigma.1 from (1), write out 2 lines of code that would calculate the normal vector and offset in the scenario that sample covariances are not equal. Note that these results will correspond to quadratic discriminant analysis on Gaussian data. (c) How might you adjust the above function s INPUT arguments to handle the case of unequal covariances? Hint: think about using a logical value for an argument named equal.covariance.matrices. (d) (OPTIONAL) Note that once (c) has been completed, one can use the if() statement to handle cases of equal or unequal covariance matrices. To do this, one can write if(equal.covariance.matrices == FALSE){} and if(equal.covariance.matrices == TRUE){} to handle the cases separately. Give this a try if you d like. Once done, you will have written flexible code for Linear or Quadratic discriminant analysis on Gaussian data. 2. Visualizing Linear Discriminants: Let s now see how our LDA function performs. Load the training and test data from the course website using the following commands: training = read.table(" = TRUE) test = read.table(" header = TRUE) The training data are simulated in the following manner: - Labels are first chosen at random with probability P (Label = 1) = The first of two variables (X 1 ) is generated conditionally on the data: X 1 Label = 1 N(0, 1) X 1 Label = 0 N(3, 1) - The second of the two variables (X 2 ) is generated as: X 2 = X 1 + N(0, 1). The test data are also generated as Normal random variables with mean either 1 or 4 and standard deviation 1. Plot the training data to look for any noticeable structure by using the following commands: plot the two variables of the training data plot(training$x.1, training$x.2, col = as.factor(training$labels), xlab = "x1", ylab = "x2", main = "Training Data") add a legend to the plot legend("bottomright", c("0","1"), col = c("black","red"), pch = c(1,1)) 3

4 Keep this plot up so that we can add a curve momentarily. Now, calculate the linear discriminant of this data using your function My.Gaussian.LDA from (1). Do this using the following code: Results = My.Gaussian.LDA(training,test) Type Results in your console to review the output of your function. Now, let s add the linear discriminant to your plot (which should still be visible). Add the line of the discriminant using the abline( ) command as in the following: abline(a = -Results$offset/Results$norm.vec[2], b = -Results$norm.vec[1]/Results$norm.vec[2], col = "green") Now, let s view the test data set and see how these will be classified according to our discriminant rule. Make these plots using the following code: plot(test$x.1, test$x.2, xlab = "x1", ylab = "x2", main = "Test Data") Add the discriminant line abline(a = -Results$offset/Results$norm.vec[2], b = -Results$norm.vec[1]/Results$norm.vec[2], col = "green") Questions: (a) Comment on the discriminant and the training data. Are there are any misclassifications on this data? (b) Comment on the discriminant and the test data. How many test points are classified as 1? How many are classified as 0? Discuss any potential uncertainty based on your plot. 3. Quadratic Discriminant Analysis: We can perform quadratic discriminant analysis in R by simply using the qda( ) command. It should be noted that we can also use lda( ) command to run linear discriminant analysis. Run quadratic discriminant analysis on this data using the following code: Split the training data into variables and labels training.variables = training[,1:2] training.labels = training[,3] Run QDA quad.disc = qda(training.variables, grouping = training.labels) Now, the quad.disc variable contains summary information of the training data and can be used to predict the class of new data by using the predict( ) command. Predict the values of the test data using the following command: Predict test set Prediction.test = predict(quad.disc, test)$class Predict training set Prediction.training = predict(quad.disc,training.variables)$class Questions (a) Are there any misclassifications on the predictions for the training variables? (b) Are there any differences on the predictions for the test variables between the quadratic discriminant rule here and the linear discriminant rule in Question (2)? Based on the plot in Question (2), which point do you think was classified differently? 4

5 4. Discriminant Analysis Application: Let s try quadratic discriminant analysis on the iris dataset. We will see how well we can distinguish the setosa and virginica species based on the measured variables. First pre-process the data using the following code: Load the data data(iris) Keep only the setosa and virginica species iris.sample = iris[which(iris$species == "setosa" iris$species == "virginica"),] Keep a random sample of 50 of these for training. rand.sample = sample(1:100,50,replace = FALSE) training = iris.sample[rand.sample,] Separate these into variables into labels and variables Training set training.labels = training$species training.variables = training[,1:4] Test set test.sample = setdiff(1:100,rand.sample) test = iris.sample[test.sample,] split the variables and labels test.labels = test$species test.variables = test[,1:4] Now, run quadratic discriminant analysis on the training.sample that you just created. Then, predict the labels of the test.variables. Use the following code: quad.disc = qda(training.variables,grouping = as.numeric(training.labels)) test.predict = predict(quad.disc,test.variables)$class Questions (a) Comment on the results of QDA on the iris dataset. Compare test.labels with test.predict. How many misclassifications were there on the test set? (b) Would the My.Gaussian.LDA function that you created in Question (1) be appropriate for this dataset? Why or why not? If not, how could you adjust the function to be applicable to this dataset? 5