Computational Assignment 4: Discriminant Analysis



Similar documents
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Models in R

Lecture 3: Linear methods for classification

A short course in Longitudinal Data Analysis ESRC Research Methods and Short Course Material for Practicals with the joiner package.

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Linear Discriminant Analysis

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Viewing Ecological data using R graphics

1. Classification problems

STA 4273H: Statistical Machine Learning

Getting Started with R and RStudio 1

3F3: Signal and Pattern Processing

Lecture 9: Introduction to Pattern Analysis

Polynomial Neural Network Discovery Client User Guide

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Statistical Machine Learning

Classification Problems

Programming Exercise 3: Multi-class Classification and Neural Networks

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Classification by Pairwise Coupling

Analysis Tools and Libraries for BigData

JetBlue Airways Stock Price Analysis and Prediction

Environmental Remote Sensing GEOG 2021

Package MDM. February 19, 2015

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

CROP CLASSIFICATION WITH HYPERSPECTRAL DATA OF THE HYMAP SENSOR USING DIFFERENT FEATURE EXTRACTION TECHNIQUES

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Principal Component Analysis

3. Data Analysis, Statistics, and Probability

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Graphics in R. Biostatistics 615/815

Linear Models for Classification

Fitting Subject-specific Curves to Grouped Longitudinal Data

Using R for Linear Regression

Data Mining Lab 5: Introduction to Neural Networks

1 Topic. 2 Scilab. 2.1 What is Scilab?

DISCRIMINANT FUNCTION ANALYSIS (DA)

5 Correlation and Data Exploration

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Mini-project in TSRT04: Cell Phone Coverage

Machine Learning Logistic Regression

Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components

Data exploration with Microsoft Excel: analysing more than one variable

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev

# load in the files containing the methyaltion data and the source # code containing the SSRPMM functions

Least Squares Estimation

Data Mining: Algorithms and Applications Matrix Math Review

Java Modules for Time Series Analysis

R Commander Tutorial

Exploratory Data Analysis and Plotting

How To Cluster

Data Mining - Evaluation of Classifiers

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Nuclear Science and Technology Division (94) Multigroup Cross Section and Cross Section Covariance Data Visualization with Javapeño

Cluster Analysis using R

Analysis of Bayesian Dynamic Linear Models

SAS Software to Fit the Generalized Linear Model

Lab 13: Logistic Regression

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Scatter Plots with Error Bars

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Package dsmodellingclient

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

2013 MBA Jump Start Program. Statistics Module Part 3

Predictive Data modeling for health care: Comparative performance study of different prediction models

Advanced Statistical Methods in Insurance

Biometric Authentication using Online Signatures

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16

Didacticiel - Études de cas

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Analysis of Variance. MINITAB User s Guide 2 3-1

MultiAlign Software. Windows GUI. Console Application. MultiAlign Software Website. Test Data

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Exploratory data analysis for microarray data

Financial Econometrics MFE MATLAB Introduction. Kevin Sheppard University of Oxford

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Chapter 19 The Chi-Square Test

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Package missforest. February 20, 2015

Data Mining. Nonlinear Classification

Time Series Analysis of Aviation Data

Time Series Analysis AMS 316

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Bayes and Naïve Bayes. cs534-machine Learning

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

R Graphics II: Graphics for Exploratory Data Analysis

Prof. Nicolai Meinshausen Regression FS R Exercises

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

Azure Machine Learning, SQL Data Mining and R

Data Mining Yelp Data - Predicting rating stars from review text

Transcription:

Computational Assignment 4: Discriminant Analysis -Written by James Wilson -Edited by Andrew Nobel In this assignment, we will investigate running Fisher s Discriminant analysis in R. This is a powerful classification tool that is easy to use with basic R commands. 1. Coding your own functions in R: An important aspect of R is its flexibility in allowing the user to write functions. Until now, we have used built-in functions that are available in R; however, at some point all of these functions were manually written (perhaps by some poor student on a homework assignment). Before we begin, here are a few tips to consider when writing a function: Leave comments and documentation throughout your code. If you look at this function a year from now, it would be good to remember a) when it was written, b) what the function does, c) what the input and output variables are, and d) what the code is doing in each significant step. When documenting your code, try to make it readable for someone who has never used R. Make the function as easy to use as possible. The hope is that any code you write is one day publicly available and easy to use. If it s not easy to use, no one will use it. As you are writing the code, try parts of the function bit by bit. This way, you can catch an error before the code gets too long. Below, I provide code for Fisher s Linear Discriminant Analysis on a Gaussian dataset as an illustrative example. We will walk you through how to extend this code to the case of Quadratic discriminant analysis. All of these calculations are directly from the theory given in class. Once we make our own function, we will see how to perform qda using built-in R commands. Write the following code to a new document (a.r or.txt) and then copy and paste the function into your R console. My.Gaussian.LDA = function(training, test){ INPUT: training = n.1 x (p+1) matrix whose first p columns are observed variables and p+1st column includes the class labels. test = n.2 x p vector of observed variables with no class labels. OUTPUT: norm.vec = direction/normal vector associated with the projection of x offset = offset of projection of x onto norm.vec pred = prediction of class labels for test set 1

NOTES: 1) We assume that there are only 2 possible class labels. 2) We assume the classes have equal variance and that the data is Gaussian 3) The linear discriminant of the data is given by function g(x) = x^t * norm.vec + offset Extract summary information n = dim(training)[1] total number of samples p = dim(training)[2] - 1 number of variables labels = unique(training[,p+1]) the unique labels set.0 = which(training[,p+1] == labels[1]) set.1 = which(training[,p+1] == labels[2]) data.set.0 = training[set.0,1:p] observed variables for set.0 data.set.1 = training[set.1,1:p] observed variables for set.1 Calculate MLEs pi.0.hat = (1/n) * length(set.0) pi.1.hat = (1/n) * length(set.1) mu.0.hat = colsums(data.set.0)/length(set.0) mu.1.hat = colsums(data.set.1)/length(set.1) sigma.hat = ((t(data.set.0) - mu.0.hat)%*%t(t(data.set.0) - mu.0.hat) + (t(data.set.1) - mu.1.hat)%*%t(t(data.set.1) - mu.1.hat)) / (n-2) Coefficients of linear discriminant b = log(pi.1.hat/pi.0.hat) -.5*(mu.0.hat + mu.1.hat)%*%solve(sigma.hat)%*%(mu.1.hat - mu.0.hat) a = solve(sigma.hat)%*%(mu.1.hat - mu.0.hat) Prediction of test set Pred = rep(0,dim(test)[1]) g = as.matrix(test)%*%a + as.numeric(b) Pred[which(g>0)] = 1 Return the coefficients and Prediction return(list(norm.vec = a, offset = b, pred = Pred)) } Notes: To begin a function, we use a name = function(). In our example, we use the name My.Gaussian.LDA. INPUT arguments are placed within the parentheses of function( ). In our example our function requires INPUT arguments training and test. The main body of the function is placed between the braces {} at the beginning and end of the function. The return( ) command is required to provide OUTPUT. The OUTPUTs are placed in the parentheses of return( ). In our example, we will return a list with three variables: norm.vec, 2

offset, and pred. Importantly, without the return( ) command, there will be no OUTPUT variables when the function is used. Be sure to read through the code and make sure that you understand each of the lines as we have come across all of the commands used in the code. In the next problem we will make use of this function. Questions: (a) By using the above function, we assume that the covariance matrices of the variables of both classes of data are the same. Suppose that this was not the case. Write two lines of code that can be used in the above function to calculate the covariance matrices of each class, sigma.0 and sigma.1. (b) Given sigma.0 and sigma.1 from (1), write out 2 lines of code that would calculate the normal vector and offset in the scenario that sample covariances are not equal. Note that these results will correspond to quadratic discriminant analysis on Gaussian data. (c) How might you adjust the above function s INPUT arguments to handle the case of unequal covariances? Hint: think about using a logical value for an argument named equal.covariance.matrices. (d) (OPTIONAL) Note that once (c) has been completed, one can use the if() statement to handle cases of equal or unequal covariance matrices. To do this, one can write if(equal.covariance.matrices == FALSE){} and if(equal.covariance.matrices == TRUE){} to handle the cases separately. Give this a try if you d like. Once done, you will have written flexible code for Linear or Quadratic discriminant analysis on Gaussian data. 2. Visualizing Linear Discriminants: Let s now see how our LDA function performs. Load the training and test data from the course website using the following commands: training = read.table("http://www.unc.edu/%7ejameswd/data/training.txt",header = TRUE) test = read.table("http://www.unc.edu/%7ejameswd/data/test.txt", header = TRUE) The training data are simulated in the following manner: - Labels are first chosen at random with probability P (Label = 1) = 0.5. - The first of two variables (X 1 ) is generated conditionally on the data: X 1 Label = 1 N(0, 1) X 1 Label = 0 N(3, 1) - The second of the two variables (X 2 ) is generated as: X 2 = X 1 + N(0, 1). The test data are also generated as Normal random variables with mean either 1 or 4 and standard deviation 1. Plot the training data to look for any noticeable structure by using the following commands: plot the two variables of the training data plot(training$x.1, training$x.2, col = as.factor(training$labels), xlab = "x1", ylab = "x2", main = "Training Data") add a legend to the plot legend("bottomright", c("0","1"), col = c("black","red"), pch = c(1,1)) 3

Keep this plot up so that we can add a curve momentarily. Now, calculate the linear discriminant of this data using your function My.Gaussian.LDA from (1). Do this using the following code: Results = My.Gaussian.LDA(training,test) Type Results in your console to review the output of your function. Now, let s add the linear discriminant to your plot (which should still be visible). Add the line of the discriminant using the abline( ) command as in the following: abline(a = -Results$offset/Results$norm.vec[2], b = -Results$norm.vec[1]/Results$norm.vec[2], col = "green") Now, let s view the test data set and see how these will be classified according to our discriminant rule. Make these plots using the following code: plot(test$x.1, test$x.2, xlab = "x1", ylab = "x2", main = "Test Data") Add the discriminant line abline(a = -Results$offset/Results$norm.vec[2], b = -Results$norm.vec[1]/Results$norm.vec[2], col = "green") Questions: (a) Comment on the discriminant and the training data. Are there are any misclassifications on this data? (b) Comment on the discriminant and the test data. How many test points are classified as 1? How many are classified as 0? Discuss any potential uncertainty based on your plot. 3. Quadratic Discriminant Analysis: We can perform quadratic discriminant analysis in R by simply using the qda( ) command. It should be noted that we can also use lda( ) command to run linear discriminant analysis. Run quadratic discriminant analysis on this data using the following code: Split the training data into variables and labels training.variables = training[,1:2] training.labels = training[,3] Run QDA quad.disc = qda(training.variables, grouping = training.labels) Now, the quad.disc variable contains summary information of the training data and can be used to predict the class of new data by using the predict( ) command. Predict the values of the test data using the following command: Predict test set Prediction.test = predict(quad.disc, test)$class Predict training set Prediction.training = predict(quad.disc,training.variables)$class Questions (a) Are there any misclassifications on the predictions for the training variables? (b) Are there any differences on the predictions for the test variables between the quadratic discriminant rule here and the linear discriminant rule in Question (2)? Based on the plot in Question (2), which point do you think was classified differently? 4

4. Discriminant Analysis Application: Let s try quadratic discriminant analysis on the iris dataset. We will see how well we can distinguish the setosa and virginica species based on the measured variables. First pre-process the data using the following code: Load the data data(iris) Keep only the setosa and virginica species iris.sample = iris[which(iris$species == "setosa" iris$species == "virginica"),] Keep a random sample of 50 of these for training. rand.sample = sample(1:100,50,replace = FALSE) training = iris.sample[rand.sample,] Separate these into variables into labels and variables Training set training.labels = training$species training.variables = training[,1:4] Test set test.sample = setdiff(1:100,rand.sample) test = iris.sample[test.sample,] split the variables and labels test.labels = test$species test.variables = test[,1:4] Now, run quadratic discriminant analysis on the training.sample that you just created. Then, predict the labels of the test.variables. Use the following code: quad.disc = qda(training.variables,grouping = as.numeric(training.labels)) test.predict = predict(quad.disc,test.variables)$class Questions (a) Comment on the results of QDA on the iris dataset. Compare test.labels with test.predict. How many misclassifications were there on the test set? (b) Would the My.Gaussian.LDA function that you created in Question (1) be appropriate for this dataset? Why or why not? If not, how could you adjust the function to be applicable to this dataset? 5