Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Size: px
Start display at page:

Download "Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees"

Transcription

1 Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques. Load the workspace containing the R objects and functions for this assignment, PracticalObjects.RData. Linear and Quadratic Discriminant Analysis: Cushing s Syndrome Data In this section we consider the Cushing s syndrome data, which is available in the package MASS, and use the functions lda and qda to perform linear and quadratic discriminant analysis, respectively. These two functions are also available in the package MASS. library(mass) Cushings 1. Note that there are three types of Cushing s syndrome are distinguished, coded a, b, and c. We remove the observations from the dataset with type u (unknown). cush <- Cushings[Cushings$Type!="u",] cush[,3] <- factor(cush[,3],levels=c("a","b","c")) cush.type<-cush[,3] The function factor is used to create a vector cush.type that indicates the syndrome type of each observation in the reconstructed dataset cush. The log transform of the continuous variables ensure that their distributions are more symmetrical. boxplot(cush[,1:2]) cush[,1] <- log(cush[,1]) cush[,2] <- log(cush[,2]) boxplot(cush[,1:2]) 2. First, we perform a linear discriminant analysis on the Cushing s data. The function lda accepts the continuous variables as its first argument and the class labels as its second argument. cush.lda<-lda(cush[,1:2], cush.type) cush.lda The command cush.lda is used to display some of the discriminant analysis results. Note that the linear discriminants, a 1 and a 2, are displayed.

2 We use the method predict to project observations onto the linear discriminants. The argument dimen specifies the number of discriminant components onto which the data will be projected. cush.pred<-predict(cush.lda, dimen=2) After the object cush.pred is created we need to extract the scores (projected observations) on the first dimen discriminant variables. cush.pr<-cush.pred$x Let us look at the data projected onto the first two discriminant components. eqscplot(cush.pr, type="n", xlab="first linear discriminant", ylab="second linear discriminant", main="lda") text(cush.pr, labels=as.character(cush.type), cex=0.8) 3. We can create a plot that indicates the linear discriminant analysis decision boundaries of the dataset cush[,1:2]. The function dec.bound.plot, in the workspace PracticalObjects, is used to create such a plot. This function only accepts datasets with two continuous variables and one categorical (or nominal) variable that indicate the group or class of each observation. The first argument of this function is the object created by using lda (or qda), and the second and third arguments are the data (continuous variables) and the class labels, respectively. dec.bound.plot(cush.lda, cush[,1:2], cush.type, "Tetrahydrocortisone", "Pregnanetriol", "LDA") Examining the plot, it is clear that the fourth, fifth, and sixth arguments of dec.bound.plot are the x axis label, y axis label, and the title of the plot, respectively. 4. The function qda can be used to perform a quadratic discriminant analysis and dec.bound.plot enables us to visualise the quadratic decision boundaries. cush.qda <- qda(cush[,1:2], cush.type) dec.bound.plot(cush.qda, cush[,1:2], cush.type, "Tetrahydrocortisone", "Pregnanetriol", "LDA") 5. The function visu.lqda, which is also in the workspace PracticalObjects, can be used to display the estimated Gaussian distributions for each class in addition to the linear or quadratic decision boundaries. For LDA the orientation and spread of the estimated densities for each class are similar. visu.lqda(cush.lda, cush) Compare this to the spread and orientation of the estimated Gaussian distributions for QDA. Are there any differences? visu.lqda(cush.qda, cush) Note that the second argument of the function visu.lqda is the full dataset (continuous variables and grouping factor). 6. The reduced rank LDA does not use all projection directions. For the Cushing s data we can use a maximum of two projection directions. The function visu.rr1.lda can

3 be used to fit a reduced rank LDA with rank 1 to the cush dataset, and to visualise the decision boundaries. visu.rr1.lda(cush.lda, cush) The solid black lines indicate the decision boundaries, while the coloured dashed lines indicate the projection of the data points onto one dimension. Linear and Quadratic Discriminant Analysis: Vanveer Data We consider the breast tumour data again and perform discriminant analyses on different subsets of the genes. For linear discriminant analysis we will use smaller subsets of the genes to ensure that the estimated covariance matrices are of full rank. vanv.10<-vanveer.4000[,1:11] vanv.20<-vanveer.4000[,1:21] vanv.prog<-vanveer.4000[,1] 1. In this section we will look more closely at the proportion of observations that are accurately classified when we us the LDA method. We perform an LDA on the dataset containing the subset of 10 best genes. As above we use the function predict to extract information about the projection of the data onto the linear discriminant components. vanv.lda.10 <- lda(vanv.10[,2:10],vanv.prog) vanv.pred.10<-predict(vanv.lda.10) Calculate the proportion of patients that are accurately classified. The function table counts the number of patients with prognosis poor that are classified as good or poor, and the number of patients with prognosis good that are classified as good or poor. table(vanv.progn, vanv.pred.10$class) How many patients are misclassified? We can adjust the above commands to see how many patients are misclassified when we use the subset of 20 best genes. vanv.lda.20 <- lda(vanv.20[,2:21],vanv.prog) vanv.pred.20<-predict(vanv.lda.20) table(vanv.progn, vanv.pred.20$class) 2. Note that we used the same data to fit and assess the classification performance. A more accurate estimate of the misclassification rate can be obtained by dividing the dataset into a training set and a test set. We randomly select patients for the training set (roughly 60% of the patients), and the remaining patients constitute the test set. train<-runif(nrow(vanv.10))<0.60 test<-!train The vectors train and test are indicator vectors they indicate which observations in the dataset vanv.10 belong to the training set, and which belong to the test set.

4 We perform a linear discriminant analysis on the data in the training set. However, we predict the data in the test set on the linear discriminant components. To calculate the number of misclassifications in the test set we compare the predicted classes of the observations in the test set with their true classes. vanv.lda.10 <- lda(vanv.10[train,2:11],vanv.prog[train]) vanv.pred.10<-predict(vanv.lda.10, vanv.10[test,2:11]) table(vanv.progn[test], vanv.pred.10$class) How many patients in the test set are misclassified, for the subset of 10 best genes? Decision Trees: Cushing s Syndrome Data In this section we look at the Cushing s syndrome data in the library MASS. 1. First we grow a full decision tree, and examine splits that are carried out. The function rpart in the package rpart is used to grow a tree. First we specify the arguments formula and data. The argument formula requires that we indicate the grouping factor on the left hand side of the and the other variables, which determine the grouping, on the right hand side. The argument data is used to indicate the data frame which contains the variable named in the formula. Note how we specify the formula: Type.. Here Type is the name of the variable that contains the grouping/class label of each observation, and the. indicates all the other variables in the data frame. Two control parameters are specified as well: minsplit and cp. The former indicates the minimum number of observations that must exist in a node in order for a split to be attempted while the complexity parameter cp specifies that any split that does not decrease the overall lack of fit by a factor of cp should not be attempted. (Use the command?rpart.control to look at various parameters that control aspects of the rpart fit.) library(rpart) cush.tree <- rpart(type~.,data=cush,cp=0,minsplit=3) summary(cush.tree) plot(cush.tree) text(cush.tree) The function summary can be used to display details of the fitted tree. The function plot.partition in the workspace PracticalObjects is used to visualise the partitioning of the Cushing s data in two dimensions. Note that this function can only be used to visualise datasets with two continuous variables (which explains the grouping). plot.partition(cush.tree, cush) 2. To prune the decision tree we can use cross-validation to pick a subtree. The function plotcp gives a visual representation of the cross-validation results of an rpart object. The vertical lines show the standard errors of the respective subtrees while the horizontal

5 line is one standard error worse than the best subtree. A good choice of the size of the subtree is often the leftmost value for which the mean lies below the horizontal line. plotcp(cush.tree) Accordingly, we choose the subtree of size three and with a cp value of 0.15 for the Cushing s dataset. The function prune is used to prune the full decision tree to a subtree of a specific size. We have to specify the value of cp when we use the function prune to ensure that the tree is pruned to the desired extent. cush.prune <- prune(cush.tree,cp=0.15) plot(cush.prune) text(cush.prune) Look at the partitioning of the feature space that corresponds to this decision tree. plot.partition(cush.prune, cush) How many data points are misclassified? Decision Trees: Singh Data In this section we use the prostate cancer data of Singh et al. (2002), discussed in Example 2, Section We examine the performance of decision trees on gene expression data. 1. The full decision tree The Singh dataset consists of a training and test set, singh.train and singh.test, respectively. These datasets are stored in the workspace PracticalObjects. Let us examine the Singh data set. First, we can look at the names of the variables in the dataset by using the function dimnames. Second, it is also of interest to find the dimensions of each dataset. For this purpose we use the functions nrow and ncol which gives the number of rows and columns of the dataset, respectively. dimnames(singh.train)[[2]] nrow(singh.train) ncol(singh.train) nrow(singh.test) ncol(singh.test) We note that the first variable in the Singh dataset, outcome, indicates whether it is a tumour or non-tumour sample. How are these two classes indicated? levels(singh.train[,1]) A full tree is grown for the training data, and plotted. singh.tree <- rpart(outcome~.,data=singh.train,cp=0) plot(singh.tree) text(singh.tree) How many genes are used to classify the data?

6 2. The performance of the tree is evaluated by calculating the training error and test error. To determine the classification of each observation in the training set we use the function predict. singh.prob.pred<-predict(singh.tree, singh.train) singh.prob.pred If the posterior probability for the class label normal is larger than 0.5, then an observation is classified as normal, and as tumour otherwise. Let us create a vector that indicates the classification of each observation in the training dataset. The function ifelse can be used to create such a vector. singh.class<-ifelse(singh.prob.pred[,1]>0.5,"normal","tumour") singh.class Read more about this function (?ifelse). Now we can compute the training error by comparing the vector singh.train$outcome (which contains the class labels for the training set) with the vector singh.class (containing the predicted classes). The training error is the number of misclassified observations divided by the total number of observations in the training set. singh.train.error<-sum(singh.class!=singh.train$outcome)/ nrow(singh.train) Make sure you understand how we calculated the training error by breaking the above command down into smaller parts: misclass.vector<- singh.class!=singh.train$outcome number.misclass<-sum(misclass.vector) number.obs.train.set<- nrow(singh.train) singh.train.error<- number.misclass / number.obs.train.set misclass.vector number.misclass singh.train.error The test error can be calculated by predicting the classes of the test set first. singh.prob.pred<-predict(singh.tree,singh.test) Exercises 1. LDA and QDA Adjust the commands given in point?? (LDA and QDA: Vanveer Data section) and calculate the proportion of misclassifications in the test set for the subsets of 15 and 20 best genes. Compare these proportions to the proportion of misclassifications calculated at point?? when we used the whole dataset to perform an LDA and to assess its performance. Do you notice any differences? 2. We can use the same training set and test set as defined above (point??) and assess the performance of quadratic discriminant analysis. The quadratic decision boundaries are more flexible than the linear decision boundaries (as seen in the plots constructed for

7 the Cushing s data). This can create the expectation that QDA will lead to a lower rate of misclassification. Adjust the above code to compute the proportion of misclassified patients for the datasets containing the subsets of 10 and 15 best genes, respectively. Compare these proportions to the corresponding proportions of misclassifications obtained by using LDA. Does QDA perform better than LDA, when we compare the error rates? Is the proportion of misclassifications high overall? 3. Trees: Singh Data Return to point?? (Trees: Singh Data section) in the practical assignment. Calculate the test error for the Singh data. Compare this with the training error. Is the test error large relative to the training error?

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Computational Assignment 4: Discriminant Analysis

Computational Assignment 4: Discriminant Analysis Computational Assignment 4: Discriminant Analysis -Written by James Wilson -Edited by Andrew Nobel In this assignment, we will investigate running Fisher s Discriminant analysis in R. This is a powerful

More information

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

More information

Homework Assignment 7

Homework Assignment 7 Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

!!!#$$%&'()*+$(,%!#$%$&'()*%(+,'-*&./#-$&'(-&(0*.$#-$1(2&.3$'45 !"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

JetBlue Airways Stock Price Analysis and Prediction

JetBlue Airways Stock Price Analysis and Prediction JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue

More information

Didacticiel - Études de cas

Didacticiel - Études de cas 1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com

More information

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how

More information

Professor Anita Wasilewska. Classification Lecture Notes

Professor Anita Wasilewska. Classification Lecture Notes Professor Anita Wasilewska Classification Lecture Notes Classification (Data Mining Book Chapters 5 and 7) PART ONE: Supervised learning and Classification Data format: training and test data Concept,

More information

Didacticiel Études de cas

Didacticiel Études de cas 1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see

More information

2 Decision tree + Cross-validation with R (package rpart)

2 Decision tree + Cross-validation with R (package rpart) 1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing

More information

Data mining techniques: decision trees

Data mining techniques: decision trees Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39

More information

Package MDM. February 19, 2015

Package MDM. February 19, 2015 Type Package Title Multinomial Diversity Model Version 1.3 Date 2013-06-28 Package MDM February 19, 2015 Author Glenn De'ath ; Code for mdm was adapted from multinom in the nnet package

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Visualization methods for patent data

Visualization methods for patent data Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

Big Data: Rethinking Text Visualization

Big Data: Rethinking Text Visualization Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important

More information

Model Selection. Introduction. Model Selection

Model Selection. Introduction. Model Selection Model Selection Introduction This user guide provides information about the Partek Model Selection tool. Topics covered include using a Down syndrome data set to demonstrate the usage of the Partek Model

More information

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

More information

Classification/Decision Trees (II)

Classification/Decision Trees (II) Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Right Sized Trees Let the expected misclassification rate of a tree T be R (T ).

More information

A short course in Longitudinal Data Analysis ESRC Research Methods and Short Course Material for Practicals with the joiner package.

A short course in Longitudinal Data Analysis ESRC Research Methods and Short Course Material for Practicals with the joiner package. A short course in Longitudinal Data Analysis ESRC Research Methods and Short Course Material for Practicals with the joiner package. Lab 2 - June, 2008 1 jointdata objects To analyse longitudinal data

More information

STAT 503X Case Study 2: Italian Olive Oils

STAT 503X Case Study 2: Italian Olive Oils STAT 503X Case Study 2: Italian Olive Oils 1 Description This data consists of the percentage composition of 8 fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic)

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

More information

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan

More information

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

1 Topic. 2 Scilab. 2.1 What is Scilab?

1 Topic. 2 Scilab. 2.1 What is Scilab? 1 Topic Data Mining with Scilab. I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical

More information

3. Data Analysis, Statistics, and Probability

3. Data Analysis, Statistics, and Probability 3. Data Analysis, Statistics, and Probability Data and probability sense provides students with tools to understand information and uncertainty. Students ask questions and gather and use data to answer

More information

What is Data mining?

What is Data mining? STAT : DATA MIIG Javier Cabrera Fall Business Question Answer Business Question What is Data mining? Find Data Data Processing Extract Information Data Analysis Internal Databases Data Warehouses Internet

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

More information

Ideas and Tools for a Data Mining Style of Analysis

Ideas and Tools for a Data Mining Style of Analysis Ideas and Tools for a Data Mining Style of Analysis John Maindonald July 2, 2009 Contents 1 Resampling and Other Computer Intensive Methods 2 2 Key Motivations 4 2.1 Computing Technology is Transforming

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

Classification Problems

Classification Problems Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

How To Do Data Mining In R

How To Do Data Mining In R Data Mining with R John Maindonald (Centre for Mathematics and Its Applications, Australian National University) and Yihui Xie (School of Statistics, Renmin University of China) December 13, 2008 Data

More information

Hierarchical Clustering Analysis

Hierarchical Clustering Analysis Hierarchical Clustering Analysis What is Hierarchical Clustering? Hierarchical clustering is used to group similar objects into clusters. In the beginning, each row and/or column is considered a cluster.

More information

Classification by Pairwise Coupling

Classification by Pairwise Coupling Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Analysis Tools and Libraries for BigData

Analysis Tools and Libraries for BigData + Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Big Data Decision Trees with R

Big Data Decision Trees with R REVOLUTION ANALYTICS WHITE PAPER Big Data Decision Trees with R By Richard Calaway, Lee Edlefsen, and Lixin Gong Fast, Scalable, Distributable Decision Trees Revolution Analytics RevoScaleR package provides

More information

Data exploration with Microsoft Excel: analysing more than one variable

Data exploration with Microsoft Excel: analysing more than one variable Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical

More information

Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

More information

COC131 Data Mining - Clustering

COC131 Data Mining - Clustering COC131 Data Mining - Clustering Martin D. Sykora m.d.sykora@lboro.ac.uk Tutorial 05, Friday 20th March 2009 1. Fire up Weka (Waikako Environment for Knowledge Analysis) software, launch the explorer window

More information

Data Mining Lab 5: Introduction to Neural Networks

Data Mining Lab 5: Introduction to Neural Networks Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Getting Started with R and RStudio 1

Getting Started with R and RStudio 1 Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Linear Discriminant Analysis

Linear Discriminant Analysis Fiche TD avec le logiciel : course5 Linear Discriminant Analysis A.B. Dufour Contents 1 Fisher s iris dataset 2 2 The principle 5 2.1 Linking one variable and a factor.................. 5 2.2 Linking a

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

Classification and Regression by randomforest

Classification and Regression by randomforest Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

More information

Data Analysis of Trends in iphone 5 Sales on ebay

Data Analysis of Trends in iphone 5 Sales on ebay Data Analysis of Trends in iphone 5 Sales on ebay By Wenyu Zhang Mentor: Professor David Aldous Contents Pg 1. Introduction 3 2. Data and Analysis 4 2.1 Description of Data 4 2.2 Retrieval of Data 5 2.3

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Machine Learning Logistic Regression

Machine Learning Logistic Regression Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

TOWARD BIG DATA ANALYSIS WORKSHOP

TOWARD BIG DATA ANALYSIS WORKSHOP TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)

More information

IBM SPSS Decision Trees 21

IBM SPSS Decision Trees 21 IBM SPSS Decision Trees 21 Note: Before using this information and the product it supports, read the general information under Notices on p. 104. This edition applies to IBM SPSS Statistics 21 and to all

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln Log-Rank Test for More Than Two Groups Prepared by Harlan Sayles (SRAM) Revised by Julia Soulakova (Statistics)

More information

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d. EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models

More information

College Tuition: Data mining and analysis

College Tuition: Data mining and analysis CS105 College Tuition: Data mining and analysis By Jeanette Chu & Khiem Tran 4/28/2010 Introduction College tuition issues are steadily increasing every year. According to the college pricing trends report

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Implementation of Breiman s Random Forest Machine Learning Algorithm

Implementation of Breiman s Random Forest Machine Learning Algorithm Implementation of Breiman s Random Forest Machine Learning Algorithm Frederick Livingston Abstract This research provides tools for exploring Breiman s Random Forest algorithm. This paper will focus on

More information

Business Analytics and Credit Scoring

Business Analytics and Credit Scoring Study Unit 5 Business Analytics and Credit Scoring ANL 309 Business Analytics Applications Introduction Process of credit scoring The role of business analytics in credit scoring Methods of logistic regression

More information