Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees


 Tyrone Marshall
 2 years ago
 Views:
Transcription
1 Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and treebased classification techniques. Load the workspace containing the R objects and functions for this assignment, PracticalObjects.RData. Linear and Quadratic Discriminant Analysis: Cushing s Syndrome Data In this section we consider the Cushing s syndrome data, which is available in the package MASS, and use the functions lda and qda to perform linear and quadratic discriminant analysis, respectively. These two functions are also available in the package MASS. library(mass) Cushings 1. Note that there are three types of Cushing s syndrome are distinguished, coded a, b, and c. We remove the observations from the dataset with type u (unknown). cush < Cushings[Cushings$Type!="u",] cush[,3] < factor(cush[,3],levels=c("a","b","c")) cush.type<cush[,3] The function factor is used to create a vector cush.type that indicates the syndrome type of each observation in the reconstructed dataset cush. The log transform of the continuous variables ensure that their distributions are more symmetrical. boxplot(cush[,1:2]) cush[,1] < log(cush[,1]) cush[,2] < log(cush[,2]) boxplot(cush[,1:2]) 2. First, we perform a linear discriminant analysis on the Cushing s data. The function lda accepts the continuous variables as its first argument and the class labels as its second argument. cush.lda<lda(cush[,1:2], cush.type) cush.lda The command cush.lda is used to display some of the discriminant analysis results. Note that the linear discriminants, a 1 and a 2, are displayed.
2 We use the method predict to project observations onto the linear discriminants. The argument dimen specifies the number of discriminant components onto which the data will be projected. cush.pred<predict(cush.lda, dimen=2) After the object cush.pred is created we need to extract the scores (projected observations) on the first dimen discriminant variables. cush.pr<cush.pred$x Let us look at the data projected onto the first two discriminant components. eqscplot(cush.pr, type="n", xlab="first linear discriminant", ylab="second linear discriminant", main="lda") text(cush.pr, labels=as.character(cush.type), cex=0.8) 3. We can create a plot that indicates the linear discriminant analysis decision boundaries of the dataset cush[,1:2]. The function dec.bound.plot, in the workspace PracticalObjects, is used to create such a plot. This function only accepts datasets with two continuous variables and one categorical (or nominal) variable that indicate the group or class of each observation. The first argument of this function is the object created by using lda (or qda), and the second and third arguments are the data (continuous variables) and the class labels, respectively. dec.bound.plot(cush.lda, cush[,1:2], cush.type, "Tetrahydrocortisone", "Pregnanetriol", "LDA") Examining the plot, it is clear that the fourth, fifth, and sixth arguments of dec.bound.plot are the x axis label, y axis label, and the title of the plot, respectively. 4. The function qda can be used to perform a quadratic discriminant analysis and dec.bound.plot enables us to visualise the quadratic decision boundaries. cush.qda < qda(cush[,1:2], cush.type) dec.bound.plot(cush.qda, cush[,1:2], cush.type, "Tetrahydrocortisone", "Pregnanetriol", "LDA") 5. The function visu.lqda, which is also in the workspace PracticalObjects, can be used to display the estimated Gaussian distributions for each class in addition to the linear or quadratic decision boundaries. For LDA the orientation and spread of the estimated densities for each class are similar. visu.lqda(cush.lda, cush) Compare this to the spread and orientation of the estimated Gaussian distributions for QDA. Are there any differences? visu.lqda(cush.qda, cush) Note that the second argument of the function visu.lqda is the full dataset (continuous variables and grouping factor). 6. The reduced rank LDA does not use all projection directions. For the Cushing s data we can use a maximum of two projection directions. The function visu.rr1.lda can
3 be used to fit a reduced rank LDA with rank 1 to the cush dataset, and to visualise the decision boundaries. visu.rr1.lda(cush.lda, cush) The solid black lines indicate the decision boundaries, while the coloured dashed lines indicate the projection of the data points onto one dimension. Linear and Quadratic Discriminant Analysis: Vanveer Data We consider the breast tumour data again and perform discriminant analyses on different subsets of the genes. For linear discriminant analysis we will use smaller subsets of the genes to ensure that the estimated covariance matrices are of full rank. vanv.10<vanveer.4000[,1:11] vanv.20<vanveer.4000[,1:21] vanv.prog<vanveer.4000[,1] 1. In this section we will look more closely at the proportion of observations that are accurately classified when we us the LDA method. We perform an LDA on the dataset containing the subset of 10 best genes. As above we use the function predict to extract information about the projection of the data onto the linear discriminant components. vanv.lda.10 < lda(vanv.10[,2:10],vanv.prog) vanv.pred.10<predict(vanv.lda.10) Calculate the proportion of patients that are accurately classified. The function table counts the number of patients with prognosis poor that are classified as good or poor, and the number of patients with prognosis good that are classified as good or poor. table(vanv.progn, vanv.pred.10$class) How many patients are misclassified? We can adjust the above commands to see how many patients are misclassified when we use the subset of 20 best genes. vanv.lda.20 < lda(vanv.20[,2:21],vanv.prog) vanv.pred.20<predict(vanv.lda.20) table(vanv.progn, vanv.pred.20$class) 2. Note that we used the same data to fit and assess the classification performance. A more accurate estimate of the misclassification rate can be obtained by dividing the dataset into a training set and a test set. We randomly select patients for the training set (roughly 60% of the patients), and the remaining patients constitute the test set. train<runif(nrow(vanv.10))<0.60 test<!train The vectors train and test are indicator vectors they indicate which observations in the dataset vanv.10 belong to the training set, and which belong to the test set.
4 We perform a linear discriminant analysis on the data in the training set. However, we predict the data in the test set on the linear discriminant components. To calculate the number of misclassifications in the test set we compare the predicted classes of the observations in the test set with their true classes. vanv.lda.10 < lda(vanv.10[train,2:11],vanv.prog[train]) vanv.pred.10<predict(vanv.lda.10, vanv.10[test,2:11]) table(vanv.progn[test], vanv.pred.10$class) How many patients in the test set are misclassified, for the subset of 10 best genes? Decision Trees: Cushing s Syndrome Data In this section we look at the Cushing s syndrome data in the library MASS. 1. First we grow a full decision tree, and examine splits that are carried out. The function rpart in the package rpart is used to grow a tree. First we specify the arguments formula and data. The argument formula requires that we indicate the grouping factor on the left hand side of the and the other variables, which determine the grouping, on the right hand side. The argument data is used to indicate the data frame which contains the variable named in the formula. Note how we specify the formula: Type.. Here Type is the name of the variable that contains the grouping/class label of each observation, and the. indicates all the other variables in the data frame. Two control parameters are specified as well: minsplit and cp. The former indicates the minimum number of observations that must exist in a node in order for a split to be attempted while the complexity parameter cp specifies that any split that does not decrease the overall lack of fit by a factor of cp should not be attempted. (Use the command?rpart.control to look at various parameters that control aspects of the rpart fit.) library(rpart) cush.tree < rpart(type~.,data=cush,cp=0,minsplit=3) summary(cush.tree) plot(cush.tree) text(cush.tree) The function summary can be used to display details of the fitted tree. The function plot.partition in the workspace PracticalObjects is used to visualise the partitioning of the Cushing s data in two dimensions. Note that this function can only be used to visualise datasets with two continuous variables (which explains the grouping). plot.partition(cush.tree, cush) 2. To prune the decision tree we can use crossvalidation to pick a subtree. The function plotcp gives a visual representation of the crossvalidation results of an rpart object. The vertical lines show the standard errors of the respective subtrees while the horizontal
5 line is one standard error worse than the best subtree. A good choice of the size of the subtree is often the leftmost value for which the mean lies below the horizontal line. plotcp(cush.tree) Accordingly, we choose the subtree of size three and with a cp value of 0.15 for the Cushing s dataset. The function prune is used to prune the full decision tree to a subtree of a specific size. We have to specify the value of cp when we use the function prune to ensure that the tree is pruned to the desired extent. cush.prune < prune(cush.tree,cp=0.15) plot(cush.prune) text(cush.prune) Look at the partitioning of the feature space that corresponds to this decision tree. plot.partition(cush.prune, cush) How many data points are misclassified? Decision Trees: Singh Data In this section we use the prostate cancer data of Singh et al. (2002), discussed in Example 2, Section We examine the performance of decision trees on gene expression data. 1. The full decision tree The Singh dataset consists of a training and test set, singh.train and singh.test, respectively. These datasets are stored in the workspace PracticalObjects. Let us examine the Singh data set. First, we can look at the names of the variables in the dataset by using the function dimnames. Second, it is also of interest to find the dimensions of each dataset. For this purpose we use the functions nrow and ncol which gives the number of rows and columns of the dataset, respectively. dimnames(singh.train)[[2]] nrow(singh.train) ncol(singh.train) nrow(singh.test) ncol(singh.test) We note that the first variable in the Singh dataset, outcome, indicates whether it is a tumour or nontumour sample. How are these two classes indicated? levels(singh.train[,1]) A full tree is grown for the training data, and plotted. singh.tree < rpart(outcome~.,data=singh.train,cp=0) plot(singh.tree) text(singh.tree) How many genes are used to classify the data?
6 2. The performance of the tree is evaluated by calculating the training error and test error. To determine the classification of each observation in the training set we use the function predict. singh.prob.pred<predict(singh.tree, singh.train) singh.prob.pred If the posterior probability for the class label normal is larger than 0.5, then an observation is classified as normal, and as tumour otherwise. Let us create a vector that indicates the classification of each observation in the training dataset. The function ifelse can be used to create such a vector. singh.class<ifelse(singh.prob.pred[,1]>0.5,"normal","tumour") singh.class Read more about this function (?ifelse). Now we can compute the training error by comparing the vector singh.train$outcome (which contains the class labels for the training set) with the vector singh.class (containing the predicted classes). The training error is the number of misclassified observations divided by the total number of observations in the training set. singh.train.error<sum(singh.class!=singh.train$outcome)/ nrow(singh.train) Make sure you understand how we calculated the training error by breaking the above command down into smaller parts: misclass.vector< singh.class!=singh.train$outcome number.misclass<sum(misclass.vector) number.obs.train.set< nrow(singh.train) singh.train.error< number.misclass / number.obs.train.set misclass.vector number.misclass singh.train.error The test error can be calculated by predicting the classes of the test set first. singh.prob.pred<predict(singh.tree,singh.test) Exercises 1. LDA and QDA Adjust the commands given in point?? (LDA and QDA: Vanveer Data section) and calculate the proportion of misclassifications in the test set for the subsets of 15 and 20 best genes. Compare these proportions to the proportion of misclassifications calculated at point?? when we used the whole dataset to perform an LDA and to assess its performance. Do you notice any differences? 2. We can use the same training set and test set as defined above (point??) and assess the performance of quadratic discriminant analysis. The quadratic decision boundaries are more flexible than the linear decision boundaries (as seen in the plots constructed for
7 the Cushing s data). This can create the expectation that QDA will lead to a lower rate of misclassification. Adjust the above code to compute the proportion of misclassified patients for the datasets containing the subsets of 10 and 15 best genes, respectively. Compare these proportions to the corresponding proportions of misclassifications obtained by using LDA. Does QDA perform better than LDA, when we compare the error rates? Is the proportion of misclassifications high overall? 3. Trees: Singh Data Return to point?? (Trees: Singh Data section) in the practical assignment. Calculate the test error for the Singh data. Compare this with the training error. Is the test error large relative to the training error?
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationComputational Assignment 4: Discriminant Analysis
Computational Assignment 4: Discriminant Analysis Written by James Wilson Edited by Andrew Nobel In this assignment, we will investigate running Fisher s Discriminant analysis in R. This is a powerful
More informationData Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
More informationHomework Assignment 7
Homework Assignment 7 36350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the emails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
More informationEnterprise Miner  Decision tree 1
Enterprise Miner  Decision tree 1 ECLT5810 ECommerce Data Mining Technique SAS Enterprise Miner  Decision Tree I. Tree Node Setting Tree Node Defaults  define default options that you commonly use
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationComparison of Nonlinear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Nonlinear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Nonlinear
More informationClass #6: Nonlinear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Nonlinear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Nonlinear classification Linear Support Vector Machines
More information!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'*&./#$&'(&(0*".$#$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'*&./#$&'(&(0*".$#$1"(2&."3$'45"!"#"$%&#'()*+',$$.&#',/"0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
More informationClassification and regression trees
Classification and regression trees December 9 Introduction We ve seen that local methods and splines both operate by partitioning the sample space of the regression variable(s), and then fitting separate/piecewise
More informationData Clustering. Dec 2nd, 2013 Kyrylo Bessonov
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms kmeans Hierarchical Main
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationEnvironmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
More informationImplementation in Enterprise Miner: Decision Tree with Binary Response
Implementation in Enterprise Miner: Decision Tree with Binary Response Outline 8.1 Example 8.2 The Options in Tree Node 8.3 Tree Results 8.4 Example Continued Appendix A: Tree and Missing Values  1 
More informationA short course in Longitudinal Data Analysis ESRC Research Methods and Short Course Material for Practicals with the joiner package.
A short course in Longitudinal Data Analysis ESRC Research Methods and Short Course Material for Practicals with the joiner package. Lab 2  June, 2008 1 jointdata objects To analyse longitudinal data
More informationDidacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
More information2 Decision tree + Crossvalidation with R (package rpart)
1 Subject Using crossvalidation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of crossvalidation for assessing
More informationDidacticiel  Études de cas
1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.
More informationDoptimal plans in observational studies
Doptimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationAn Overview and Evaluation of Decision Tree Methodology
An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationSupervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
More informationJetBlue Airways Stock Price Analysis and Prediction
JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationM1 in Economics and Economics and Statistics Applied multivariate Analysis  Big data analytics Worksheet 3  Random Forest
Nathalie VillaVialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis  Big data analytics Worksheet 3  Random Forest This worksheet s aim is to learn how
More informationSTATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Webbased Analytics Table of Contents INTRODUCTION: WHAT
More informationProfessor Anita Wasilewska. Classification Lecture Notes
Professor Anita Wasilewska Classification Lecture Notes Classification (Data Mining Book Chapters 5 and 7) PART ONE: Supervised learning and Classification Data format: training and test data Concept,
More informationBIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
More informationData Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
More informationData mining techniques: decision trees
Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationPackage MDM. February 19, 2015
Type Package Title Multinomial Diversity Model Version 1.3 Date 20130628 Package MDM February 19, 2015 Author Glenn De'ath ; Code for mdm was adapted from multinom in the nnet package
More informationNonnegative Matrix Factorization (NMF) in Semisupervised Learning Reducing Dimension and Maintaining Meaning
Nonnegative Matrix Factorization (NMF) in Semisupervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
More informationVisualization methods for patent data
Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes
More informationBig Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
More informationModel Selection. Introduction. Model Selection
Model Selection Introduction This user guide provides information about the Partek Model Selection tool. Topics covered include using a Down syndrome data set to demonstrate the usage of the Partek Model
More informationIdeas and Tools for a Data Mining Style of Analysis
Ideas and Tools for a Data Mining Style of Analysis John Maindonald July 2, 2009 Contents 1 Resampling and Other Computer Intensive Methods 2 2 Key Motivations 4 2.1 Computing Technology is Transforming
More informationClassifiers & Classification
Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationData Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data
More informationHierarchical Clustering Analysis
Hierarchical Clustering Analysis What is Hierarchical Clustering? Hierarchical clustering is used to group similar objects into clusters. In the beginning, each row and/or column is considered a cluster.
More informationClassification/Decision Trees (II)
Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Right Sized Trees Let the expected misclassification rate of a tree T be R (T ).
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationChapter 3 Introduction to Predictive Modeling: Predictive Modeling Fundamentals and Decision Trees
Chapter 3 Introduction to Predictive Modeling: Predictive Modeling Fundamentals and Decision Trees 3.1 Creating Training and Validation Data... 32 3.2 Constructing a Decision Tree Predictive Model...
More informationCS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
More informationSTAT 503X Case Study 2: Italian Olive Oils
STAT 503X Case Study 2: Italian Olive Oils 1 Description This data consists of the percentage composition of 8 fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic)
More informationLecture 9: Introduction to Pattern Analysis
Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns
More informationStep 5: Conduct Analysis. The CCA Algorithm
Model Parameterization: Step 5: Conduct Analysis P Dropped species with fewer than 5 occurrences P Logtransformed species abundances P Rownormalized species log abundances (chord distance) P Selected
More informationTHE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok
THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan
More informationClassification by Pairwise Coupling
Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS1332014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationGaussian Classifiers CS498
Gaussian Classifiers CS498 Today s lecture The Gaussian Gaussian classifiers A slightly more sophisticated classifier Nearest Neighbors We can classify with nearest neighbors x m 1 m 2 Decision boundary
More informationData Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
More information1 Topic. 2 Scilab. 2.1 What is Scilab?
1 Topic Data Mining with Scilab. I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical
More informationChapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida  1  Outline A brief introduction to the bootstrap Bagging: basic concepts
More informationIn this presentation, you will be introduced to data mining and the relationship with meaningful use.
In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine
More informationChapter 4: NonParametric Classification
Chapter 4: NonParametric Classification Introduction Density Estimation Parzen Windows KnNearest Neighbor Density Estimation KNearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19  Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19  Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.standrews.ac.uk twk@standrews.ac.uk Tom Kelsey ID505919B &
More informationClassification and Regression Trees
Classification and Regression Trees Bob Stine Dept of Statistics, School University of Pennsylvania Trees Familiar metaphor Biology Decision tree Medical diagnosis Org chart Properties Recursive, partitioning
More informationClassification: Basic Concepts, Decision Trees, and Model Evaluation. General Approach for Building Classification Model
10 10 Classification: Basic Concepts, Decision Trees, and Model Evaluation Dr. Hui Xiong Rutgers University Introduction to Data Mining 1//009 1 General Approach for Building Classification Model Tid Attrib1
More information3. Data Analysis, Statistics, and Probability
3. Data Analysis, Statistics, and Probability Data and probability sense provides students with tools to understand information and uncertainty. Students ask questions and gather and use data to answer
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationDecision Trees (Sections )
Decision Trees (Sections 8.18.4) Nonmetric methods CART (Classification & regression Trees) Number of splits Query selection & node impurity Multiway splits When to stop splitting? Pruning Assignment
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationSPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
More informationLearning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
More informationBig Data Decision Trees with R
REVOLUTION ANALYTICS WHITE PAPER Big Data Decision Trees with R By Richard Calaway, Lee Edlefsen, and Lixin Gong Fast, Scalable, Distributable Decision Trees Revolution Analytics RevoScaleR package provides
More informationData driven design of filter bank for speech recognition
Data driven design of filter bank for speech recognition Lukáš Burget 12 and Hynek Heřmanský 23 1 Oregon Graduate Institute, Anthropic Signal Processing Group, 2 NW Walker Rd., Beaverton, Oregon 9768921,
More informationComponent Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationLinear Discriminant Analysis
Fiche TD avec le logiciel : course5 Linear Discriminant Analysis A.B. Dufour Contents 1 Fisher s iris dataset 2 2 The principle 5 2.1 Linking one variable and a factor.................. 5 2.2 Linking a
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationEECS 349 Titanic Machine Learning From Disaster
EECS 349 Titanic Machine Learning From Disaster Xiaodong Yang Northwestern University Abstract In this project, we see how we can use machinelearning techniques to predict survivors of the Titanic. With
More information11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression
Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 letex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial
More informationData Mining with R. December 13, 2008
Data Mining with R John Maindonald (Centre for Mathematics and Its Applications, Australian National University) and Yihui Xie (School of Statistics, Renmin University of China) December 13, 2008 Data
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationClassification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationData Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationSurvey, Statistics and Psychometrics Core Research Facility University of NebraskaLincoln. LogRank Test for More Than Two Groups
Survey, Statistics and Psychometrics Core Research Facility University of NebraskaLincoln LogRank Test for More Than Two Groups Prepared by Harlan Sayles (SRAM) Revised by Julia Soulakova (Statistics)
More informationCS 525 Class Project Breast Cancer Diagnosis via Quadratic Programming Fall, 2015 Due 15 December 2015, 5:00pm
CS 525 Class Project Breast Cancer Diagnosis via Quadratic Programming Fall, 2015 Due 15 December 2015, 5:00pm In this project, we apply quadratic programming to breast cancer diagnosis. We use the Wisconsin
More informationTan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2. Tid Refund Marital Status
Data Mining Classification: Basic Concepts, Decision Trees, and Evaluation Lecture tes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar Classification: Definition Given a collection of
More informationLinear Models for Classification
Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci
More informationData exploration with Microsoft Excel: analysing more than one variable
Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical
More informationProgramming Exercise 3: Multiclass Classification and Neural Networks
Programming Exercise 3: Multiclass Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement onevsall logistic regression and neural networks
More informationWhat is Data mining?
STAT : DATA MIIG Javier Cabrera Fall Business Question Answer Business Question What is Data mining? Find Data Data Processing Extract Information Data Analysis Internal Databases Data Warehouses Internet
More informationClustering & Visualization
Chapter 5 Clustering & Visualization Clustering in highdimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to highdimensional data.
More informationSOLVING DATABASE QUERIES USING ORTHOGONAL RANGE SEARCHING
SOLVING DATABASE QUERIES USING ORTHOGONAL RANGE SEARCHING CPS234: Computational Geometry Prof. Pankaj Agarwal Prepared by: Khalid Al Issa (khalid@cs.duke.edu) Fall 2005 BACKGROUND : RANGE SEARCHING, a
More informationIII. GRAPHICAL METHODS
Pie Charts and Bar Charts: III. GRAPHICAL METHODS Pie charts and bar charts are used for depicting frequencies or relative frequencies. We compare examples of each using the same data. Sources: AT&T (1961)
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationIBM SPSS Decision Trees 21
IBM SPSS Decision Trees 21 Note: Before using this information and the product it supports, read the general information under Notices on p. 104. This edition applies to IBM SPSS Statistics 21 and to all
More informationGetting Started with R and RStudio 1
Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationData Mining Lab 5: Introduction to Neural Networks
Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese
More informationFeature Extraction and Selection. More Info == Better Performance? Curse of Dimensionality. Feature Space. HighDimensional Spaces
More Info == Better Performance? Feature Extraction and Selection APR Course, Delft, The Netherlands Marco Loog Feature Space Curse of Dimensionality A pdimensional space, in which each dimension is a
More informationPCA, Clustering and Classification. By H. Bjørn Nielsen strongly inspired by Agnieszka S. Juncker
PCA, Clustering and Classification By H. Bjørn Nielsen strongly inspired by Agnieszka S. Juncker Motivation: Multidimensional data Pat1 Pat2 Pat3 Pat4 Pat5 Pat6 Pat7 Pat8 Pat9 209619_at 7758 4705 5342
More informationMachine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
More information