Computational Assignment 4: Discriminant Analysis
|
|
|
- Ashlie Hodges
- 10 years ago
- Views:
Transcription
1 Computational Assignment 4: Discriminant Analysis -Written by James Wilson -Edited by Andrew Nobel In this assignment, we will investigate running Fisher s Discriminant analysis in R. This is a powerful classification tool that is easy to use with basic R commands. 1. Coding your own functions in R: An important aspect of R is its flexibility in allowing the user to write functions. Until now, we have used built-in functions that are available in R; however, at some point all of these functions were manually written (perhaps by some poor student on a homework assignment). Before we begin, here are a few tips to consider when writing a function: Leave comments and documentation throughout your code. If you look at this function a year from now, it would be good to remember a) when it was written, b) what the function does, c) what the input and output variables are, and d) what the code is doing in each significant step. When documenting your code, try to make it readable for someone who has never used R. Make the function as easy to use as possible. The hope is that any code you write is one day publicly available and easy to use. If it s not easy to use, no one will use it. As you are writing the code, try parts of the function bit by bit. This way, you can catch an error before the code gets too long. Below, I provide code for Fisher s Linear Discriminant Analysis on a Gaussian dataset as an illustrative example. We will walk you through how to extend this code to the case of Quadratic discriminant analysis. All of these calculations are directly from the theory given in class. Once we make our own function, we will see how to perform qda using built-in R commands. Write the following code to a new document (a.r or.txt) and then copy and paste the function into your R console. My.Gaussian.LDA = function(training, test){ INPUT: training = n.1 x (p+1) matrix whose first p columns are observed variables and p+1st column includes the class labels. test = n.2 x p vector of observed variables with no class labels. OUTPUT: norm.vec = direction/normal vector associated with the projection of x offset = offset of projection of x onto norm.vec pred = prediction of class labels for test set 1
2 NOTES: 1) We assume that there are only 2 possible class labels. 2) We assume the classes have equal variance and that the data is Gaussian 3) The linear discriminant of the data is given by function g(x) = x^t * norm.vec + offset Extract summary information n = dim(training)[1] total number of samples p = dim(training)[2] - 1 number of variables labels = unique(training[,p+1]) the unique labels set.0 = which(training[,p+1] == labels[1]) set.1 = which(training[,p+1] == labels[2]) data.set.0 = training[set.0,1:p] observed variables for set.0 data.set.1 = training[set.1,1:p] observed variables for set.1 Calculate MLEs pi.0.hat = (1/n) * length(set.0) pi.1.hat = (1/n) * length(set.1) mu.0.hat = colsums(data.set.0)/length(set.0) mu.1.hat = colsums(data.set.1)/length(set.1) sigma.hat = ((t(data.set.0) - mu.0.hat)%*%t(t(data.set.0) - mu.0.hat) + (t(data.set.1) - mu.1.hat)%*%t(t(data.set.1) - mu.1.hat)) / (n-2) Coefficients of linear discriminant b = log(pi.1.hat/pi.0.hat) -.5*(mu.0.hat + mu.1.hat)%*%solve(sigma.hat)%*%(mu.1.hat - mu.0.hat) a = solve(sigma.hat)%*%(mu.1.hat - mu.0.hat) Prediction of test set Pred = rep(0,dim(test)[1]) g = as.matrix(test)%*%a + as.numeric(b) Pred[which(g>0)] = 1 Return the coefficients and Prediction return(list(norm.vec = a, offset = b, pred = Pred)) } Notes: To begin a function, we use a name = function(). In our example, we use the name My.Gaussian.LDA. INPUT arguments are placed within the parentheses of function( ). In our example our function requires INPUT arguments training and test. The main body of the function is placed between the braces {} at the beginning and end of the function. The return( ) command is required to provide OUTPUT. The OUTPUTs are placed in the parentheses of return( ). In our example, we will return a list with three variables: norm.vec, 2
3 offset, and pred. Importantly, without the return( ) command, there will be no OUTPUT variables when the function is used. Be sure to read through the code and make sure that you understand each of the lines as we have come across all of the commands used in the code. In the next problem we will make use of this function. Questions: (a) By using the above function, we assume that the covariance matrices of the variables of both classes of data are the same. Suppose that this was not the case. Write two lines of code that can be used in the above function to calculate the covariance matrices of each class, sigma.0 and sigma.1. (b) Given sigma.0 and sigma.1 from (1), write out 2 lines of code that would calculate the normal vector and offset in the scenario that sample covariances are not equal. Note that these results will correspond to quadratic discriminant analysis on Gaussian data. (c) How might you adjust the above function s INPUT arguments to handle the case of unequal covariances? Hint: think about using a logical value for an argument named equal.covariance.matrices. (d) (OPTIONAL) Note that once (c) has been completed, one can use the if() statement to handle cases of equal or unequal covariance matrices. To do this, one can write if(equal.covariance.matrices == FALSE){} and if(equal.covariance.matrices == TRUE){} to handle the cases separately. Give this a try if you d like. Once done, you will have written flexible code for Linear or Quadratic discriminant analysis on Gaussian data. 2. Visualizing Linear Discriminants: Let s now see how our LDA function performs. Load the training and test data from the course website using the following commands: training = read.table(" = TRUE) test = read.table(" header = TRUE) The training data are simulated in the following manner: - Labels are first chosen at random with probability P (Label = 1) = The first of two variables (X 1 ) is generated conditionally on the data: X 1 Label = 1 N(0, 1) X 1 Label = 0 N(3, 1) - The second of the two variables (X 2 ) is generated as: X 2 = X 1 + N(0, 1). The test data are also generated as Normal random variables with mean either 1 or 4 and standard deviation 1. Plot the training data to look for any noticeable structure by using the following commands: plot the two variables of the training data plot(training$x.1, training$x.2, col = as.factor(training$labels), xlab = "x1", ylab = "x2", main = "Training Data") add a legend to the plot legend("bottomright", c("0","1"), col = c("black","red"), pch = c(1,1)) 3
4 Keep this plot up so that we can add a curve momentarily. Now, calculate the linear discriminant of this data using your function My.Gaussian.LDA from (1). Do this using the following code: Results = My.Gaussian.LDA(training,test) Type Results in your console to review the output of your function. Now, let s add the linear discriminant to your plot (which should still be visible). Add the line of the discriminant using the abline( ) command as in the following: abline(a = -Results$offset/Results$norm.vec[2], b = -Results$norm.vec[1]/Results$norm.vec[2], col = "green") Now, let s view the test data set and see how these will be classified according to our discriminant rule. Make these plots using the following code: plot(test$x.1, test$x.2, xlab = "x1", ylab = "x2", main = "Test Data") Add the discriminant line abline(a = -Results$offset/Results$norm.vec[2], b = -Results$norm.vec[1]/Results$norm.vec[2], col = "green") Questions: (a) Comment on the discriminant and the training data. Are there are any misclassifications on this data? (b) Comment on the discriminant and the test data. How many test points are classified as 1? How many are classified as 0? Discuss any potential uncertainty based on your plot. 3. Quadratic Discriminant Analysis: We can perform quadratic discriminant analysis in R by simply using the qda( ) command. It should be noted that we can also use lda( ) command to run linear discriminant analysis. Run quadratic discriminant analysis on this data using the following code: Split the training data into variables and labels training.variables = training[,1:2] training.labels = training[,3] Run QDA quad.disc = qda(training.variables, grouping = training.labels) Now, the quad.disc variable contains summary information of the training data and can be used to predict the class of new data by using the predict( ) command. Predict the values of the test data using the following command: Predict test set Prediction.test = predict(quad.disc, test)$class Predict training set Prediction.training = predict(quad.disc,training.variables)$class Questions (a) Are there any misclassifications on the predictions for the training variables? (b) Are there any differences on the predictions for the test variables between the quadratic discriminant rule here and the linear discriminant rule in Question (2)? Based on the plot in Question (2), which point do you think was classified differently? 4
5 4. Discriminant Analysis Application: Let s try quadratic discriminant analysis on the iris dataset. We will see how well we can distinguish the setosa and virginica species based on the measured variables. First pre-process the data using the following code: Load the data data(iris) Keep only the setosa and virginica species iris.sample = iris[which(iris$species == "setosa" iris$species == "virginica"),] Keep a random sample of 50 of these for training. rand.sample = sample(1:100,50,replace = FALSE) training = iris.sample[rand.sample,] Separate these into variables into labels and variables Training set training.labels = training$species training.variables = training[,1:4] Test set test.sample = setdiff(1:100,rand.sample) test = iris.sample[test.sample,] split the variables and labels test.labels = test$species test.variables = test[,1:4] Now, run quadratic discriminant analysis on the training.sample that you just created. Then, predict the labels of the test.variables. Use the following code: quad.disc = qda(training.variables,grouping = as.numeric(training.labels)) test.predict = predict(quad.disc,test.variables)$class Questions (a) Comment on the results of QDA on the iris dataset. Compare test.labels with test.predict. How many misclassifications were there on the test set? (b) Would the My.Gaussian.LDA function that you created in Question (1) be appropriate for this dataset? Why or why not? If not, how could you adjust the function to be applicable to this dataset? 5
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
A short course in Longitudinal Data Analysis ESRC Research Methods and Short Course Material for Practicals with the joiner package.
A short course in Longitudinal Data Analysis ESRC Research Methods and Short Course Material for Practicals with the joiner package. Lab 2 - June, 2008 1 jointdata objects To analyse longitudinal data
11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression
Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial
Linear Discriminant Analysis
Fiche TD avec le logiciel : course5 Linear Discriminant Analysis A.B. Dufour Contents 1 Fisher s iris dataset 2 2 The principle 5 2.1 Linking one variable and a factor.................. 5 2.2 Linking a
Data Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
Viewing Ecological data using R graphics
Biostatistics Illustrations in Viewing Ecological data using R graphics A.B. Dufour & N. Pettorelli April 9, 2009 Presentation of the principal graphics dealing with discrete or continuous variables. Course
1. Classification problems
Neural and Evolutionary Computing. Lab 1: Classification problems Machine Learning test data repository Weka data mining platform Introduction Scilab 1. Classification problems The main aim of a classification
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Getting Started with R and RStudio 1
Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following
3F3: Signal and Pattern Processing
3F3: Signal and Pattern Processing Lecture 3: Classification Zoubin Ghahramani [email protected] Department of Engineering University of Cambridge Lent Term Classification We will represent data by
Lecture 9: Introduction to Pattern Analysis
Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns
Polynomial Neural Network Discovery Client User Guide
Polynomial Neural Network Discovery Client User Guide Version 1.3 Table of contents Table of contents...2 1. Introduction...3 1.1 Overview...3 1.2 PNN algorithm principles...3 1.3 Additional criteria...3
MACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Classification Problems
Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems
Programming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize
Classification by Pairwise Coupling
Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating
Analysis Tools and Libraries for BigData
+ Analysis Tools and Libraries for BigData Lecture 02 Abhijit Bendale + Office Hours 2 n Terry Boult (Waiting to Confirm) n Abhijit Bendale (Tue 2:45 to 4:45 pm). Best if you email me in advance, but I
JetBlue Airways Stock Price Analysis and Prediction
JetBlue Airways Stock Price Analysis and Prediction Team Member: Lulu Liu, Jiaojiao Liu DSO530 Final Project JETBLUE AIRWAYS STOCK PRICE ANALYSIS AND PREDICTION 1 Motivation Started in February 2000, JetBlue
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Package MDM. February 19, 2015
Type Package Title Multinomial Diversity Model Version 1.3 Date 2013-06-28 Package MDM February 19, 2015 Author Glenn De'ath ; Code for mdm was adapted from multinom in the nnet package
Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler
Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?
CROP CLASSIFICATION WITH HYPERSPECTRAL DATA OF THE HYMAP SENSOR USING DIFFERENT FEATURE EXTRACTION TECHNIQUES
Proceedings of the 2 nd Workshop of the EARSeL SIG on Land Use and Land Cover CROP CLASSIFICATION WITH HYPERSPECTRAL DATA OF THE HYMAP SENSOR USING DIFFERENT FEATURE EXTRACTION TECHNIQUES Sebastian Mader
CS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
Principal Component Analysis
Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded
3. Data Analysis, Statistics, and Probability
3. Data Analysis, Statistics, and Probability Data and probability sense provides students with tools to understand information and uncertainty. Students ask questions and gather and use data to answer
M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest
Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Graphics in R. Biostatistics 615/815
Graphics in R Biostatistics 615/815 Last Lecture Introduction to R Programming Controlling Loops Defining your own functions Today Introduction to Graphics in R Examples of commonly used graphics functions
Linear Models for Classification
Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci
Fitting Subject-specific Curves to Grouped Longitudinal Data
Fitting Subject-specific Curves to Grouped Longitudinal Data Djeundje, Viani Heriot-Watt University, Department of Actuarial Mathematics & Statistics Edinburgh, EH14 4AS, UK E-mail: [email protected] Currie,
Using R for Linear Regression
Using R for Linear Regression In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional
Data Mining Lab 5: Introduction to Neural Networks
Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese
1 Topic. 2 Scilab. 2.1 What is Scilab?
1 Topic Data Mining with Scilab. I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical
DISCRIMINANT FUNCTION ANALYSIS (DA)
DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant
5 Correlation and Data Exploration
5 Correlation and Data Exploration Correlation In Unit 3, we did some correlation analyses of data from studies related to the acquisition order and acquisition difficulty of English morphemes by both
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Mini-project in TSRT04: Cell Phone Coverage
Mini-project in TSRT04: Cell hone Coverage 19 August 2015 1 roblem Formulation According to the study Swedes and Internet 2013 (Stiftelsen för Internetinfrastruktur), 99% of all Swedes in the age 12-45
Machine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components
Eigenvalues, Eigenvectors, Matrix Factoring, and Principal Components The eigenvalues and eigenvectors of a square matrix play a key role in some important operations in statistics. In particular, they
Data exploration with Microsoft Excel: analysing more than one variable
Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical
APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev
86 ITHEA APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING Anatoli Nachev Abstract: This paper presents a case study of data mining modeling techniques for direct marketing. It focuses to three
# load in the files containing the methyaltion data and the source # code containing the SSRPMM functions
################ EXAMPLE ANALYSES TO ILLUSTRATE SS-RPMM ######################## # load in the files containing the methyaltion data and the source # code containing the SSRPMM functions # Note, the SSRPMM
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Java Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
R Commander Tutorial
R Commander Tutorial Introduction R is a powerful, freely available software package that allows analyzing and graphing data. However, for somebody who does not frequently use statistical software packages,
Exploratory Data Analysis and Plotting
Exploratory Data Analysis and Plotting The purpose of this handout is to introduce you to working with and manipulating data in R, as well as how you can begin to create figures from the ground up. 1 Importing
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University
Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision
Nuclear Science and Technology Division (94) Multigroup Cross Section and Cross Section Covariance Data Visualization with Javapeño
June 21, 2006 Summary Nuclear Science and Technology Division (94) Multigroup Cross Section and Cross Section Covariance Data Visualization with Javapeño Aaron M. Fleckenstein Oak Ridge Institute for Science
Cluster Analysis using R
Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other
Analysis of Bayesian Dynamic Linear Models
Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
Lab 13: Logistic Regression
Lab 13: Logistic Regression Spam Emails Today we will be working with a corpus of emails received by a single gmail account over the first three months of 2012. Just like any other email address this account
CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York
BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not
Scatter Plots with Error Bars
Chapter 165 Scatter Plots with Error Bars Introduction The procedure extends the capability of the basic scatter plot by allowing you to plot the variability in Y and X corresponding to each point. Each
Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center
Package dsmodellingclient
Package dsmodellingclient Maintainer Author Version 4.1.0 License GPL-3 August 20, 2015 Title DataSHIELD client site functions for statistical modelling DataSHIELD
Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class
Problem 1. (10 Points) James 6.1 Problem 2. (10 Points) James 6.3 Problem 3. (10 Points) James 6.5 Problem 4. (15 Points) James 6.7 Problem 5. (15 Points) James 6.10 Homework 4 Statistics W4240: Data Mining
2013 MBA Jump Start Program. Statistics Module Part 3
2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
Advanced Statistical Methods in Insurance
Advanced Statistical Methods in Insurance 7. Multivariate Data All Pairwise Scattergrams Iris Data Set: 3 Species 50 Cases of each with p=4 measurements per case 2 Hudec & Schlögl 1 3-d Scatterplots iris[,
Biometric Authentication using Online Signatures
Biometric Authentication using Online Signatures Alisher Kholmatov and Berrin Yanikoglu [email protected], [email protected] http://fens.sabanciuniv.edu Sabanci University, Tuzla, Istanbul,
t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon
t-tests in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. [email protected] www.excelmasterseries.com
Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16
Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16 Since R is command line driven and the primary software of Chapter 16, this document details
Didacticiel - Études de cas
1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Analysis of Variance. MINITAB User s Guide 2 3-1
3 Analysis of Variance Analysis of Variance Overview, 3-2 One-Way Analysis of Variance, 3-5 Two-Way Analysis of Variance, 3-11 Analysis of Means, 3-13 Overview of Balanced ANOVA and GLM, 3-18 Balanced
MultiAlign Software. Windows GUI. Console Application. MultiAlign Software Website. Test Data
MultiAlign Software This documentation describes MultiAlign and its features. This serves as a quick guide for starting to use MultiAlign. MultiAlign comes in two forms: as a graphical user interface (GUI)
Review Jeopardy. Blue vs. Orange. Review Jeopardy
Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?
Exploratory data analysis for microarray data
Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany [email protected] Visualization
Financial Econometrics MFE MATLAB Introduction. Kevin Sheppard University of Oxford
Financial Econometrics MFE MATLAB Introduction Kevin Sheppard University of Oxford October 21, 2013 2007-2013 Kevin Sheppard 2 Contents Introduction i 1 Getting Started 1 2 Basic Input and Operators 5
Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data
Effective Linear Discriant Analysis for High Dimensional, Low Sample Size Data Zhihua Qiao, Lan Zhou and Jianhua Z. Huang Abstract In the so-called high dimensional, low sample size (HDLSS) settings, LDA
We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap
Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2
Chapter 19 The Chi-Square Test
Tutorial for the integration of the software R with introductory statistics Copyright c Grethe Hystad Chapter 19 The Chi-Square Test In this chapter, we will discuss the following topics: We will plot
WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
Package missforest. February 20, 2015
Type Package Package missforest February 20, 2015 Title Nonparametric Missing Value Imputation using Random Forest Version 1.4 Date 2013-12-31 Author Daniel J. Stekhoven Maintainer
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Time Series Analysis of Aviation Data
Time Series Analysis of Aviation Data Dr. Richard Xie February, 2012 What is a Time Series A time series is a sequence of observations in chorological order, such as Daily closing price of stock MSFT in
Time Series Analysis AMS 316
Time Series Analysis AMS 316 Programming language and software environment for data manipulation, calculation and graphical display. Originally created by Ross Ihaka and Robert Gentleman at University
KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics
ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of
Bayes and Naïve Bayes. cs534-machine Learning
Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
R Graphics II: Graphics for Exploratory Data Analysis
UCLA Department of Statistics Statistical Consulting Center Irina Kukuyeva [email protected] April 26, 2010 Outline 1 Summary Plots 2 Time Series Plots 3 Geographical Plots 4 3D Plots 5 Simulation
Prof. Nicolai Meinshausen Regression FS 2014. R Exercises
Prof. Nicolai Meinshausen Regression FS 2014 R Exercises 1. The goal of this exercise is to get acquainted with different abilities of the R statistical software. It is recommended to use the distributed
Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.
Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/
Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence 2014 http://www.iet.unipi.it/p.ducange/esercitazionibi/ Email: [email protected] Office: Dipartimento di Ingegneria
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
