Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
|
|
- Tyrone Marshall
- 8 years ago
- Views:
Transcription
1 Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques. Load the workspace containing the R objects and functions for this assignment, PracticalObjects.RData. Linear and Quadratic Discriminant Analysis: Cushing s Syndrome Data In this section we consider the Cushing s syndrome data, which is available in the package MASS, and use the functions lda and qda to perform linear and quadratic discriminant analysis, respectively. These two functions are also available in the package MASS. library(mass) Cushings 1. Note that there are three types of Cushing s syndrome are distinguished, coded a, b, and c. We remove the observations from the dataset with type u (unknown). cush <- Cushings[Cushings$Type!="u",] cush[,3] <- factor(cush[,3],levels=c("a","b","c")) cush.type<-cush[,3] The function factor is used to create a vector cush.type that indicates the syndrome type of each observation in the reconstructed dataset cush. The log transform of the continuous variables ensure that their distributions are more symmetrical. boxplot(cush[,1:2]) cush[,1] <- log(cush[,1]) cush[,2] <- log(cush[,2]) boxplot(cush[,1:2]) 2. First, we perform a linear discriminant analysis on the Cushing s data. The function lda accepts the continuous variables as its first argument and the class labels as its second argument. cush.lda<-lda(cush[,1:2], cush.type) cush.lda The command cush.lda is used to display some of the discriminant analysis results. Note that the linear discriminants, a 1 and a 2, are displayed.
2 We use the method predict to project observations onto the linear discriminants. The argument dimen specifies the number of discriminant components onto which the data will be projected. cush.pred<-predict(cush.lda, dimen=2) After the object cush.pred is created we need to extract the scores (projected observations) on the first dimen discriminant variables. cush.pr<-cush.pred$x Let us look at the data projected onto the first two discriminant components. eqscplot(cush.pr, type="n", xlab="first linear discriminant", ylab="second linear discriminant", main="lda") text(cush.pr, labels=as.character(cush.type), cex=0.8) 3. We can create a plot that indicates the linear discriminant analysis decision boundaries of the dataset cush[,1:2]. The function dec.bound.plot, in the workspace PracticalObjects, is used to create such a plot. This function only accepts datasets with two continuous variables and one categorical (or nominal) variable that indicate the group or class of each observation. The first argument of this function is the object created by using lda (or qda), and the second and third arguments are the data (continuous variables) and the class labels, respectively. dec.bound.plot(cush.lda, cush[,1:2], cush.type, "Tetrahydrocortisone", "Pregnanetriol", "LDA") Examining the plot, it is clear that the fourth, fifth, and sixth arguments of dec.bound.plot are the x axis label, y axis label, and the title of the plot, respectively. 4. The function qda can be used to perform a quadratic discriminant analysis and dec.bound.plot enables us to visualise the quadratic decision boundaries. cush.qda <- qda(cush[,1:2], cush.type) dec.bound.plot(cush.qda, cush[,1:2], cush.type, "Tetrahydrocortisone", "Pregnanetriol", "LDA") 5. The function visu.lqda, which is also in the workspace PracticalObjects, can be used to display the estimated Gaussian distributions for each class in addition to the linear or quadratic decision boundaries. For LDA the orientation and spread of the estimated densities for each class are similar. visu.lqda(cush.lda, cush) Compare this to the spread and orientation of the estimated Gaussian distributions for QDA. Are there any differences? visu.lqda(cush.qda, cush) Note that the second argument of the function visu.lqda is the full dataset (continuous variables and grouping factor). 6. The reduced rank LDA does not use all projection directions. For the Cushing s data we can use a maximum of two projection directions. The function visu.rr1.lda can
3 be used to fit a reduced rank LDA with rank 1 to the cush dataset, and to visualise the decision boundaries. visu.rr1.lda(cush.lda, cush) The solid black lines indicate the decision boundaries, while the coloured dashed lines indicate the projection of the data points onto one dimension. Linear and Quadratic Discriminant Analysis: Vanveer Data We consider the breast tumour data again and perform discriminant analyses on different subsets of the genes. For linear discriminant analysis we will use smaller subsets of the genes to ensure that the estimated covariance matrices are of full rank. vanv.10<-vanveer.4000[,1:11] vanv.20<-vanveer.4000[,1:21] vanv.prog<-vanveer.4000[,1] 1. In this section we will look more closely at the proportion of observations that are accurately classified when we us the LDA method. We perform an LDA on the dataset containing the subset of 10 best genes. As above we use the function predict to extract information about the projection of the data onto the linear discriminant components. vanv.lda.10 <- lda(vanv.10[,2:10],vanv.prog) vanv.pred.10<-predict(vanv.lda.10) Calculate the proportion of patients that are accurately classified. The function table counts the number of patients with prognosis poor that are classified as good or poor, and the number of patients with prognosis good that are classified as good or poor. table(vanv.progn, vanv.pred.10$class) How many patients are misclassified? We can adjust the above commands to see how many patients are misclassified when we use the subset of 20 best genes. vanv.lda.20 <- lda(vanv.20[,2:21],vanv.prog) vanv.pred.20<-predict(vanv.lda.20) table(vanv.progn, vanv.pred.20$class) 2. Note that we used the same data to fit and assess the classification performance. A more accurate estimate of the misclassification rate can be obtained by dividing the dataset into a training set and a test set. We randomly select patients for the training set (roughly 60% of the patients), and the remaining patients constitute the test set. train<-runif(nrow(vanv.10))<0.60 test<-!train The vectors train and test are indicator vectors they indicate which observations in the dataset vanv.10 belong to the training set, and which belong to the test set.
4 We perform a linear discriminant analysis on the data in the training set. However, we predict the data in the test set on the linear discriminant components. To calculate the number of misclassifications in the test set we compare the predicted classes of the observations in the test set with their true classes. vanv.lda.10 <- lda(vanv.10[train,2:11],vanv.prog[train]) vanv.pred.10<-predict(vanv.lda.10, vanv.10[test,2:11]) table(vanv.progn[test], vanv.pred.10$class) How many patients in the test set are misclassified, for the subset of 10 best genes? Decision Trees: Cushing s Syndrome Data In this section we look at the Cushing s syndrome data in the library MASS. 1. First we grow a full decision tree, and examine splits that are carried out. The function rpart in the package rpart is used to grow a tree. First we specify the arguments formula and data. The argument formula requires that we indicate the grouping factor on the left hand side of the and the other variables, which determine the grouping, on the right hand side. The argument data is used to indicate the data frame which contains the variable named in the formula. Note how we specify the formula: Type.. Here Type is the name of the variable that contains the grouping/class label of each observation, and the. indicates all the other variables in the data frame. Two control parameters are specified as well: minsplit and cp. The former indicates the minimum number of observations that must exist in a node in order for a split to be attempted while the complexity parameter cp specifies that any split that does not decrease the overall lack of fit by a factor of cp should not be attempted. (Use the command?rpart.control to look at various parameters that control aspects of the rpart fit.) library(rpart) cush.tree <- rpart(type~.,data=cush,cp=0,minsplit=3) summary(cush.tree) plot(cush.tree) text(cush.tree) The function summary can be used to display details of the fitted tree. The function plot.partition in the workspace PracticalObjects is used to visualise the partitioning of the Cushing s data in two dimensions. Note that this function can only be used to visualise datasets with two continuous variables (which explains the grouping). plot.partition(cush.tree, cush) 2. To prune the decision tree we can use cross-validation to pick a subtree. The function plotcp gives a visual representation of the cross-validation results of an rpart object. The vertical lines show the standard errors of the respective subtrees while the horizontal
5 line is one standard error worse than the best subtree. A good choice of the size of the subtree is often the leftmost value for which the mean lies below the horizontal line. plotcp(cush.tree) Accordingly, we choose the subtree of size three and with a cp value of 0.15 for the Cushing s dataset. The function prune is used to prune the full decision tree to a subtree of a specific size. We have to specify the value of cp when we use the function prune to ensure that the tree is pruned to the desired extent. cush.prune <- prune(cush.tree,cp=0.15) plot(cush.prune) text(cush.prune) Look at the partitioning of the feature space that corresponds to this decision tree. plot.partition(cush.prune, cush) How many data points are misclassified? Decision Trees: Singh Data In this section we use the prostate cancer data of Singh et al. (2002), discussed in Example 2, Section We examine the performance of decision trees on gene expression data. 1. The full decision tree The Singh dataset consists of a training and test set, singh.train and singh.test, respectively. These datasets are stored in the workspace PracticalObjects. Let us examine the Singh data set. First, we can look at the names of the variables in the dataset by using the function dimnames. Second, it is also of interest to find the dimensions of each dataset. For this purpose we use the functions nrow and ncol which gives the number of rows and columns of the dataset, respectively. dimnames(singh.train)[[2]] nrow(singh.train) ncol(singh.train) nrow(singh.test) ncol(singh.test) We note that the first variable in the Singh dataset, outcome, indicates whether it is a tumour or non-tumour sample. How are these two classes indicated? levels(singh.train[,1]) A full tree is grown for the training data, and plotted. singh.tree <- rpart(outcome~.,data=singh.train,cp=0) plot(singh.tree) text(singh.tree) How many genes are used to classify the data?
6 2. The performance of the tree is evaluated by calculating the training error and test error. To determine the classification of each observation in the training set we use the function predict. singh.prob.pred<-predict(singh.tree, singh.train) singh.prob.pred If the posterior probability for the class label normal is larger than 0.5, then an observation is classified as normal, and as tumour otherwise. Let us create a vector that indicates the classification of each observation in the training dataset. The function ifelse can be used to create such a vector. singh.class<-ifelse(singh.prob.pred[,1]>0.5,"normal","tumour") singh.class Read more about this function (?ifelse). Now we can compute the training error by comparing the vector singh.train$outcome (which contains the class labels for the training set) with the vector singh.class (containing the predicted classes). The training error is the number of misclassified observations divided by the total number of observations in the training set. singh.train.error<-sum(singh.class!=singh.train$outcome)/ nrow(singh.train) Make sure you understand how we calculated the training error by breaking the above command down into smaller parts: misclass.vector<- singh.class!=singh.train$outcome number.misclass<-sum(misclass.vector) number.obs.train.set<- nrow(singh.train) singh.train.error<- number.misclass / number.obs.train.set misclass.vector number.misclass singh.train.error The test error can be calculated by predicting the classes of the test set first. singh.prob.pred<-predict(singh.tree,singh.test) Exercises 1. LDA and QDA Adjust the commands given in point?? (LDA and QDA: Vanveer Data section) and calculate the proportion of misclassifications in the test set for the subsets of 15 and 20 best genes. Compare these proportions to the proportion of misclassifications calculated at point?? when we used the whole dataset to perform an LDA and to assess its performance. Do you notice any differences? 2. We can use the same training set and test set as defined above (point??) and assess the performance of quadratic discriminant analysis. The quadratic decision boundaries are more flexible than the linear decision boundaries (as seen in the plots constructed for
7 the Cushing s data). This can create the expectation that QDA will lead to a lower rate of misclassification. Adjust the above code to compute the proportion of misclassified patients for the datasets containing the subsets of 10 and 15 best genes, respectively. Compare these proportions to the corresponding proportions of misclassifications obtained by using LDA. Does QDA perform better than LDA, when we compare the error rates? Is the proportion of misclassifications high overall? 3. Trees: Singh Data Return to point?? (Trees: Singh Data section) in the practical assignment. Calculate the test error for the Singh data. Compare this with the training error. Is the test error large relative to the training error?
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationComputational Assignment 4: Discriminant Analysis
Computational Assignment 4: Discriminant Analysis -Written by James Wilson -Edited by Andrew Nobel In this assignment, we will investigate running Fisher s Discriminant analysis in R. This is a powerful
More informationData Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
More informationHomework Assignment 7
Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More information!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
More informationComparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationJetBlue Airways Stock Price Analysis and Prediction
JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue
More informationDidacticiel - Études de cas
1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationSupervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
More informationEnvironmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
More informationBIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
More informationAn Overview and Evaluation of Decision Tree Methodology
An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com
More informationData Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationD-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationSTATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
More informationM1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest
Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how
More informationProfessor Anita Wasilewska. Classification Lecture Notes
Professor Anita Wasilewska Classification Lecture Notes Classification (Data Mining Book Chapters 5 and 7) PART ONE: Supervised learning and Classification Data format: training and test data Concept,
More informationDidacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
More information2 Decision tree + Cross-validation with R (package rpart)
1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing
More informationData mining techniques: decision trees
Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39
More informationPackage MDM. February 19, 2015
Type Package Title Multinomial Diversity Model Version 1.3 Date 2013-06-28 Package MDM February 19, 2015 Author Glenn De'ath ; Code for mdm was adapted from multinom in the nnet package
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationVisualization methods for patent data
Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationNon-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
More informationBig Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs anton.heijs@treparel.com Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
More informationModel Selection. Introduction. Model Selection
Model Selection Introduction This user guide provides information about the Partek Model Selection tool. Topics covered include using a Down syndrome data set to demonstrate the usage of the Partek Model
More informationCS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
More informationClassification/Decision Trees (II)
Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Right Sized Trees Let the expected misclassification rate of a tree T be R (T ).
More informationA short course in Longitudinal Data Analysis ESRC Research Methods and Short Course Material for Practicals with the joiner package.
A short course in Longitudinal Data Analysis ESRC Research Methods and Short Course Material for Practicals with the joiner package. Lab 2 - June, 2008 1 jointdata objects To analyse longitudinal data
More informationSTAT 503X Case Study 2: Italian Olive Oils
STAT 503X Case Study 2: Italian Olive Oils 1 Description This data consists of the percentage composition of 8 fatty acids (palmitic, palmitoleic, stearic, oleic, linoleic, linolenic, arachidic, eicosenoic)
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationLecture 9: Introduction to Pattern Analysis
Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns
More informationTHE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok
THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan
More information11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression
Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial
More informationData Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
More informationIn this presentation, you will be introduced to data mining and the relationship with meaningful use.
In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More information1 Topic. 2 Scilab. 2.1 What is Scilab?
1 Topic Data Mining with Scilab. I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical
More information3. Data Analysis, Statistics, and Probability
3. Data Analysis, Statistics, and Probability Data and probability sense provides students with tools to understand information and uncertainty. Students ask questions and gather and use data to answer
More informationWhat is Data mining?
STAT : DATA MIIG Javier Cabrera Fall Business Question Answer Business Question What is Data mining? Find Data Data Processing Extract Information Data Analysis Internal Databases Data Warehouses Internet
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &
More informationChapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
More informationIdeas and Tools for a Data Mining Style of Analysis
Ideas and Tools for a Data Mining Style of Analysis John Maindonald July 2, 2009 Contents 1 Resampling and Other Computer Intensive Methods 2 2 Key Motivations 4 2.1 Computing Technology is Transforming
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationClassification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationHow To Do Data Mining In R
Data Mining with R John Maindonald (Centre for Mathematics and Its Applications, Australian National University) and Yihui Xie (School of Statistics, Renmin University of China) December 13, 2008 Data
More informationHierarchical Clustering Analysis
Hierarchical Clustering Analysis What is Hierarchical Clustering? Hierarchical clustering is used to group similar objects into clusters. In the beginning, each row and/or column is considered a cluster.
More informationClassification by Pairwise Coupling
Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating
More informationLinear Models for Classification
Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationAnalysis Tools and Libraries for BigData
+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationBig Data Decision Trees with R
REVOLUTION ANALYTICS WHITE PAPER Big Data Decision Trees with R By Richard Calaway, Lee Edlefsen, and Lixin Gong Fast, Scalable, Distributable Decision Trees Revolution Analytics RevoScaleR package provides
More informationData exploration with Microsoft Excel: analysing more than one variable
Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical
More informationProgramming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationSPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
More informationClustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
More informationUniversité de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr
Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
More informationCOC131 Data Mining - Clustering
COC131 Data Mining - Clustering Martin D. Sykora m.d.sykora@lboro.ac.uk Tutorial 05, Friday 20th March 2009 1. Fire up Weka (Waikako Environment for Knowledge Analysis) software, launch the explorer window
More informationData Mining Lab 5: Introduction to Neural Networks
Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More informationGetting Started with R and RStudio 1
Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationLinear Discriminant Analysis
Fiche TD avec le logiciel : course5 Linear Discriminant Analysis A.B. Dufour Contents 1 Fisher s iris dataset 2 2 The principle 5 2.1 Linking one variable and a factor.................. 5 2.2 Linking a
More informationComponent Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
More informationClassification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationHT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
More informationData Analysis of Trends in iphone 5 Sales on ebay
Data Analysis of Trends in iphone 5 Sales on ebay By Wenyu Zhang Mentor: Professor David Aldous Contents Pg 1. Introduction 3 2. Data and Analysis 4 2.1 Description of Data 4 2.2 Retrieval of Data 5 2.3
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationMachine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
More informationLearning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
More informationMachine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
More informationTOWARD BIG DATA ANALYSIS WORKSHOP
TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)
More informationIBM SPSS Decision Trees 21
IBM SPSS Decision Trees 21 Note: Before using this information and the product it supports, read the general information under Notices on p. 104. This edition applies to IBM SPSS Statistics 21 and to all
More informationClassification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
More informationInsurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
More informationExperiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
More informationSurvey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups
Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln Log-Rank Test for More Than Two Groups Prepared by Harlan Sayles (SRAM) Revised by Julia Soulakova (Statistics)
More informationEXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
More informationCollege Tuition: Data mining and analysis
CS105 College Tuition: Data mining and analysis By Jeanette Chu & Khiem Tran 4/28/2010 Introduction College tuition issues are steadily increasing every year. According to the college pricing trends report
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationStatistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
More informationImplementation of Breiman s Random Forest Machine Learning Algorithm
Implementation of Breiman s Random Forest Machine Learning Algorithm Frederick Livingston Abstract This research provides tools for exploring Breiman s Random Forest algorithm. This paper will focus on
More informationBusiness Analytics and Credit Scoring
Study Unit 5 Business Analytics and Credit Scoring ANL 309 Business Analytics Applications Introduction Process of credit scoring The role of business analytics in credit scoring Methods of logistic regression
More information