CANONICAL CORRELATION ANALYSIS
|
|
- Amy Warner
- 7 years ago
- Views:
Transcription
1 CANONICAL CORRELATION ANALYSIS V.K. Bhatia I.A.S.R.I., Library Avenue, New Delhi A canonical correlation is the correlation of two canonical (latent) variables, one representing a set of independent variables, the other a set of dependent variables. Each set may be considered a latent variable based on measured indicator variables in its set. The canonical correlation is optimized such that the linear correlation between the two latent variables is maximized. Whereas multiple regression is used for many-to-one relationships, canonical correlation is used for many-to-many relationships. There may be more than one such linear correlation relating the two sets of variables, with each such correlation representing a different dimension by which the independent set of variables is related to the dependent set. The purpose of canonical correlation is to explain the relation of the two sets of variables, not to model the individual variables. Analogous with ordinary correlation, canonical correlation squared is the percent of variance in the dependent set explained by the independent set of variables along a given dimension (there may be more than one). In addition to asking how strong the relationship is between two latent variables, canonical correlation is useful in determining how many dimensions are needed to account for that relationship. Canonical correlation finds the linear combination of variables that produces the largest correlation with the second set of variables. This linear combination, or "root," is extracted and the process is repeated for the residual data, with the constraint that the second linear combination of variables must not correlate with the first one. The process is repeated until a successive linear combination is no longer significant. Canonical correlation is a member of the multiple general linear hypothesis (MLGH) family and shares many of the assumptions of mutliple regression such as linearity of relationships, homoscedasticity (same level of relationship for the full range of the data), interval or nearinterval data, untruncated variables, proper specification of the model, lack of high multicollinearity, and multivariate normality for purposes of hypothesis testing. Often in applied research, scientists encounter variables of large dimensions and are faced with the problem of understanding dependency structures, reduction of dimensionalities, construction of a subset of good predictors from the explanatory variables, etc. Canonical correlation Analysis (CCA) provides us with a tool to attack these problems. However, its appeal and hence its motivation seed to differ from the theoretical statisticians to the social scientists. We deal here with the various motivations of CCA as mentioned above and related statistical inference procedures. Dependency between Two sets of Stochastic Variables Let X: px1 be a random vector partitioned into two subvectors X 1 :p1x1 and X 2; p 2 x1, p 1 <p 2, p 1 +p 2 =p. Assume EX=0. In order to study the dependency between X 1 and X 2, we seek to evaluate the maximum possible correlation between any two arbitrary linear compounds
2 U=α ' X 1 and V=γ' X 2 subject to the normalizations, Var (U) = =α ' Σ 11 α = 1 and Var (V) = γ'σ 11 γ = 1' where, Disp. (X) = = is partitioned according to that of X as above, It follows that, this maximum correlation, say ρ 1 is given by the positive square root of the largest eigen root among the eigen roots ρ 2 1 ρ ρ 2 r... ρ 2 pl, of Σ 12 Σ Σ 21 in the metrix of Σ 11, i.e. of Σ 12 Σ Σ 21 Σ α and γ are then given by, α 1,γ 1 such that α ' Σ 11 α 1 =γ ' 1 Σ 22 γ 1 =1 and, ρ α ρ = 0 γ 1 (2.1) Alternatively, α and γ may be obtained as the eigen vector solutions, subject to the same normalizations, from (Σ 11-1 Σ 12-1 Σ 22-1 Σ 21 - ρ 2 I)α =0, (Σ 22-1 Σ 21 Σ Σ 12 - ρ 2 I)γ =0 (2.2) Further it follows that, α = Σ 11-1 Σ 12 γ/ρ and γ =Σ 22-1 Σ 21 α/ρ, (2.3) so that one needs to solve only one of the two equations in (2.2). ρ 1 is called the (first) canonical correlation between X 1 and X 2 and (U 1, V 1 ) = (α 1 ' X 1,γ 1 ' X 2 ) the pair of first canonical varieties. If Σ ii, i=1 or 2 happens to be singular, one can use a g-inverse Σ ii - in place of Σ ii -1 above. now that, p 1 = p 2 =1 ρ 1 = usual Pearson's product moment correlation coefficient between the scalar random variables X 1 and X 2 ; p 1 = 1, p 2 = p 2 >1 ρ 1 = Multiple correlation coefficient between the scalar X 1 and the vector X 2. Sample analogues are trivially defined. Reduction of Dimensionality In case p 2 or p 1 is large, it may become necessary to achieve a reduction of dimensionality but without sacrificing much of the dependency between X 1 and X 2. We then seek further linear combinations U i =α ' X 1, V i = γ 1 ' X 2, i = 1,2,..., r+1, such that U r+1 and V r+1 are maximally correlated among all linear combinations subject to having unit variances and further subject to being uncorrelated with U 1, V 1,...U r, V r. It turns out that Corr. (U r+1, V r+1 ) = ρ r+1 and α r+1, γ r+1 are simply solutions of (2.1) with ρ 1 replaced by ρ r+1. When ρ k+1 is judged to be insignificant compared to zero for some k+1, one may then retain only (U i, V i ), i=1,2,...k variables for further analysis in place of the original = ρ 1 + ρ 2 IV-38
3 presumably much larger number of variables. Note however, that information on all ρ 1 + ρ 2 variables X 1 and X 2 are still needed even to construct these 2k new variables. Canonical Correlation in SPSS o Canonical correlation is part of MANOVA in SPSS, in which one has to refer to one set of variables as "dependent" and the other as "covariates." It is available only in syntax. The command syntax method is as follows, where set1 and set2 are variable lists: MANOVA set1 WITH SET2 /DISCRIM ALL ALPHA(1) /PRINT SIGNIF(MULTIV UNIV EIGEN DIMENR). Note one cannot save canonical scores in this method. o Canonical correlation has to be run in syntax, not from the SPSS menus. If you just want to create a dataset with canonical variables, as part of the Advanced Statistics module SPSS supplies the CANCORR macro located in the file canonic correlation.sps, usually in the same directory as the SPSS main program. Open the syntax window with File, New, Syntax. Enter this: INCLUDE 'c:\program Files\SPSS\Canonical correlation.sps'. CANCORR SET1=varlist/ SET2=varlist/. where "varlist" is one of two lists of numeric variables. Output will be saved to a file called "cc_tmp2.sav," which will contain the canonical scores as new variables along with the original data file. These scores will be labeled s1_cv1 and s1_cv1, s2_cv1 and s2_cv2, and the like, standing for the scores on the two canonical variables associated with each canonical correlation. The macro will create two canonical variables for a number of canonical correlations equal to the smaller number of variables in SET1 or SET2. o OVERALS, which is part of the SPSS Categories module, computes nonlinear canonical correlation analysis on two or more sets of variables. Some Comments on the Canonical Correlations There could be a situation where some of variables have high structure correlations even though their canonical weights are near zero. This could happen because the weights are partial coefficients whereas the structure correlations (canonical factor loadings) are not: if a given variable shares variance with other independent variables entered in the linear combination of variables used to create a canonical variable, its canonical coefficient (weight) is computed based on the residual variance it can explain after controlling for these variables. If an independent variable is totally redundant with another independent variable, its partial coefficient (canonical weight) will be zero. Nonetheless, such a variable might have a high correlation with the canonical variable (that is, a high structure coefficient). In summary, the canonical weights have to do with the unique contributions of an original variable to the canonical variable, whereas the structure correlations have to do with the simple, overall correlation of the original variable with the canonical variable. Canonical correlation is not a measure of the percent of variance explained in the original variables. The square of the structure correlation is the percent of the variance in a given original variable accounted for by a given canonical variable on a given (usually the first) canonical correlation. Note that the average percent of variance explained in the original IV-39
4 variables by a canonical variable (the mean of the squared structure correlations for the canonical variable) is not at all the same as the canonical correlation, which has to do with the correlation between the weighted sums of the two sets of variables. Put another way, the canonical correlation does not tell us how much of the variance in the original variables is explained by the canonical variables. Instead, that is determined on the basis of the squares of the structure correlations. Canonical coefficients can be used to explain with which original variables a canonical correlation is predominantly associated. The canonical coefficients are standardized coefficients and (like beta weights in regression) their magnitudes can be compared. Looking at the columns in SPSS output which list the canonical coefficients as columns and the variables in a set of variables as rows, some researchers simply note variables with the highest coefficients to determine which variables are associated with which canonical correlations and use this as the basis for inducing the meaning of the dimension represented by the canonical correlation. However, Levine (1977) argues against the procedure above on the ground that the canonical coefficients may be subject to multicollinearity, leading to incorrect judgments. Also, because of suppression, a canonical coefficient may even have a different sign compared to the correlation of the original variable with the canonical variable. Therefore, instead, Levine recommends interpreting the relations of the original variables to a canonical variable in terms of the correlations of the original variables with the canonical variables - that is, by structure coefficients. This is now the standard approach. Redundancy in Canonical Correlation Analysis Redundancy is the percent of variance in one set of variables accounted for by the variate of the other set. The researcher wants high redundancy, indicating that independent variate accounts for a high percent of the variance in the dependent set of original variables. Note this is not the canonical correlation squared, which the percent of variance in the dependent variate is accounted for by the independent variate. The redundancy analysis section of SAS output looks like that below, where rows 1 and 2 refer to the first and second canonical correlations extracted for these data. Italicized comments are not part of SAS output. Canonical Redundancy Analysis Raw variance tables are reported by SAS but are omitted here because redundancy is normally interpreted using the standardized tables. Standardized Variance of the dependent variables Explained by Their Own The Opposite Cumulative Canonical Cumulative Proportion Proportion R-Squared Proportion Proportion IV-40
5 The table above shows that, for the first canonical correlation, although the independent canonical variable explains 47.15% of the variance in the dependent canonical variable, the independent canonical variable is able to predict only 11.29% of the variance in the individual original dependent variables. Also, the dependent canonical variable predicts only 23.94% of the variance in the individual original dependent variables. Similar statements could be made about the second canonical correlation (row 2). Canonical Redundancy Analysis Standardized Variance of the independent variables Explained by Their Own The Opposite Cumulative Canonical Cumulative Proportion Proportion R-Squared Proportion Proportion The table above repeats the first, except for comparisons involving the independent canonical variable. Canonical Redundancy Analysis Squared Multiple Correlations Between the dependent variables and the First 'M' of the independent variables M 1 2 Y Y Y In the table above, the columns represent the canonical correlations and the rows represent the original dependent variables, three in this case. The R-squareds are the percent of variance in each original dependent variable explained by the independent canonical variables. A similar table for the independent variables and the dependent canonical variables is also output by SAS but is not reproduced here. Nonlinear Canonical Correlation (OVERALS) Nonlinear canonical correlation analysis corresponds to categorical canonical correlation analysis with optimal scaling. The OVERALS procedure in SPSS (part of SPSS Categories) implements nonlinear canonical correlation. Independent variables can be nominal, ordinal, or interval, and there can be more than two sets of variables (more than one independent set and one dependent set). Whereas ordinary canonical correlation maximizes correlations between the variable sets, in OVERALS the sets are compared to an unknown compromise set defined by the object scores OVERALS uses optimal scaling, which quantifies categorical variables and then treats as numerical variables, including applying nonlinear transformations to find the best-fitting model. For nominal variables, the order of the categories is not retained but values are created IV-41
6 for each category such that goodness of fit is maximized. For ordinal variables, order is retained and values maximizing fit are created. For interval variables, order is retained as are equal distances between values. Obtain OVERALS from the SPSS menu by selecting Analyze, Data Reduction, Optimal Scaling; Select Multiple sets; Select either Some variable(s) not multiple nominal or All variables multiple nominal; click Define; define at least two sets of variables; define the value range and measurement scale (optimal scaling level) for each selected variable. SPSS output includes frequencies, centroids, iteration history, object scores, category quantifications, weights, component loadings, single and multiple fit, object scores plots, category coordinates plots, component loadings plots, category centroids plots, and transformation plots. Tip: To minimize output, use the Automatic Recode facility on the Transform menu to create consecutive categories beginning with 1 for variables treated as nominal or ordinal. To minimize output, for each variable scaled at the numerical (integer) level, subtract the smallest observed value from every value and add 1. Warning: Optimal scaling recodes values on the fly to maximize goodness of fit for the given data. As with any atheoretical, post-hoc data mining procedure, there is a danger of overfitting the model to the given data. Therefore, it is particularly appropriate to employ crossvalidation, developing the model for a training dataset and then assessing its generalizability by running the model on a separate validation dataset. The SPSS manual notes, "If each set contains one variable, nonlinear canonical correlation analysis is equivalent to principal components analysis with optimal scaling. If each of these variables is multiple nominal, the analysis corresponds to homogeneity analysis. If two sets of variables are involved and one of the sets contains only one variable, the analysis is identical to categorical regression with optimal scaling." Reference Levine, Mark S. (1977). Canonical Analysis and Factor Comparison. Thousand Oaks, CA: Sage Publications, Quantitative Applications in the Social Sciences Series, No. 6. IV-42
Dimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
More informationAdditional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
More informationDISCRIMINANT FUNCTION ANALYSIS (DA)
DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant
More informationPartial Least Squares (PLS) Regression.
Partial Least Squares (PLS) Regression. Hervé Abdi 1 The University of Texas at Dallas Introduction Pls regression is a recent technique that generalizes and combines features from principal component
More informationCanonical Correlation Analysis
Canonical Correlation Analysis LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the similarities and differences between multiple regression, factor analysis,
More informationModeration. Moderation
Stats - Moderation Moderation A moderator is a variable that specifies conditions under which a given predictor is related to an outcome. The moderator explains when a DV and IV are related. Moderation
More informationIntroduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
More informationChapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS
Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple
More informationReview Jeopardy. Blue vs. Orange. Review Jeopardy
Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?
More informationSimple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
More informationHow to Get More Value from Your Survey Data
Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................2
More informationThis chapter will demonstrate how to perform multiple linear regression with IBM SPSS
CHAPTER 7B Multiple Regression: Statistical Methods Using IBM SPSS This chapter will demonstrate how to perform multiple linear regression with IBM SPSS first using the standard method and then using the
More informationIntroduction to Principal Components and FactorAnalysis
Introduction to Principal Components and FactorAnalysis Multivariate Analysis often starts out with data involving a substantial number of correlated variables. Principal Component Analysis (PCA) is a
More informationPrinciple Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationMultivariate Analysis of Variance (MANOVA)
Multivariate Analysis of Variance (MANOVA) Aaron French, Marcelo Macedo, John Poulsen, Tyler Waterson and Angela Yu Keywords: MANCOVA, special cases, assumptions, further reading, computations Introduction
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationRegression III: Advanced Methods
Lecture 16: Generalized Additive Models Regression III: Advanced Methods Bill Jacoby Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Goals of the Lecture Introduce Additive Models
More informationData analysis process
Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationJanuary 26, 2009 The Faculty Center for Teaching and Learning
THE BASICS OF DATA MANAGEMENT AND ANALYSIS A USER GUIDE January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS Table of Contents Table of Contents... i
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationOverview Classes. 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)
Overview Classes 12-3 Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7) 2-4 Loglinear models (8) 5-4 15-17 hrs; 5B02 Building and
More informationMISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
More information6 Variables: PD MF MA K IAH SBS
options pageno=min nodate formdlim='-'; title 'Canonical Correlation, Journal of Interpersonal Violence, 10: 354-366.'; data SunitaPatel; infile 'C:\Users\Vati\Documents\StatData\Sunita.dat'; input Group
More informationMultivariate Analysis of Variance (MANOVA)
Chapter 415 Multivariate Analysis of Variance (MANOVA) Introduction Multivariate analysis of variance (MANOVA) is an extension of common analysis of variance (ANOVA). In ANOVA, differences among various
More informationCommon factor analysis
Common factor analysis This is what people generally mean when they say "factor analysis" This family of techniques uses an estimate of common variance among the original variables to generate the factor
More informationSAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
More informationData Analysis Tools. Tools for Summarizing Data
Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool
More informationEconometrics Simple Linear Regression
Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight
More informationInstructions for SPSS 21
1 Instructions for SPSS 21 1 Introduction... 2 1.1 Opening the SPSS program... 2 1.2 General... 2 2 Data inputting and processing... 2 2.1 Manual input and data processing... 2 2.2 Saving data... 3 2.3
More informationSPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011
SPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011 Statistical techniques to be covered Explore relationships among variables Correlation Regression/Multiple regression Logistic regression Factor analysis
More informationCanonical Correlation Analysis
Canonical Correlation Analysis Lecture 11 August 4, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #11-8/4/2011 Slide 1 of 39 Today s Lecture Canonical Correlation Analysis
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationIBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA
CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA Summer 2013, Version 2.0 Table of Contents Introduction...2 Downloading the
More informationRachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA
PROC FACTOR: How to Interpret the Output of a Real-World Example Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA ABSTRACT THE METHOD This paper summarizes a real-world example of a factor
More informationUnivariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
More informationMultivariate Statistical Inference and Applications
Multivariate Statistical Inference and Applications ALVIN C. RENCHER Department of Statistics Brigham Young University A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim
More informationIBM SPSS Missing Values 22
IBM SPSS Missing Values 22 Note Before using this information and the product it supports, read the information in Notices on page 23. Product Information This edition applies to version 22, release 0,
More informationIntroduction to Matrix Algebra
Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary
More informationSimple linear regression
Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between
More informationDirections for using SPSS
Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...
More informationAnalyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest
Analyzing Intervention Effects: Multilevel & Other Approaches Joop Hox Methodology & Statistics, Utrecht Simplest Intervention Design R X Y E Random assignment Experimental + Control group Analysis: t
More informationMultivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
More informationCorrelational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots
Correlational Research Stephen E. Brock, Ph.D., NCSP California State University, Sacramento 1 Correlational Research A quantitative methodology used to determine whether, and to what degree, a relationship
More informationLogistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
More informationSPSS Guide: Regression Analysis
SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar
More informationSimple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
More informationMultiple Regression: What Is It?
Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in
More informationData Analysis in SPSS. February 21, 2004. If you wish to cite the contents of this document, the APA reference for them would be
Data Analysis in SPSS Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Heather Claypool Department of Psychology Miami University
More informationDescriptive Statistics
Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize
More informationOne-Way ANOVA using SPSS 11.0. SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate
1 One-Way ANOVA using SPSS 11.0 This section covers steps for testing the difference between three or more group means using the SPSS ANOVA procedures found in the Compare Means analyses. Specifically,
More informationSimple Linear Regression, Scatterplots, and Bivariate Correlation
1 Simple Linear Regression, Scatterplots, and Bivariate Correlation This section covers procedures for testing the association between two continuous variables using the SPSS Regression and Correlate analyses.
More informationUnit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a
More information5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
More informationDidacticiel - Études de cas
1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.
More informationANALYSIS OF FACTOR BASED DATA MINING TECHNIQUES
Advances in Information Mining ISSN: 0975 3265 & E-ISSN: 0975 9093, Vol. 3, Issue 1, 2011, pp-26-32 Available online at http://www.bioinfo.in/contents.php?id=32 ANALYSIS OF FACTOR BASED DATA MINING TECHNIQUES
More informationPOLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.
Polynomial Regression POLYNOMIAL AND MULTIPLE REGRESSION Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model. It is a form of linear regression
More informationDEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,
More informationData Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression
Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction
More informationSPSS Explore procedure
SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,
More informationFactor Analysis. Chapter 420. Introduction
Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.
More informationSimple Regression Theory II 2010 Samuel L. Baker
SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the
More informationExample: Boats and Manatees
Figure 9-6 Example: Boats and Manatees Slide 1 Given the sample data in Table 9-1, find the value of the linear correlation coefficient r, then refer to Table A-6 to determine whether there is a significant
More informationIntroduction to Quantitative Methods
Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................
More informationOrdinal Regression. Chapter
Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe
More informationChapter 13 Introduction to Linear Regression and Correlation Analysis
Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing
More informationHYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
More informationIntroduction to Linear Regression
14. Regression A. Introduction to Simple Linear Regression B. Partitioning Sums of Squares C. Standard Error of the Estimate D. Inferential Statistics for b and r E. Influential Observations F. Regression
More information2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
More informationMULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
More informationBEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
More informationDESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS
DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 seema@iasri.res.in 1. Descriptive Statistics Statistics
More informationT-test & factor analysis
Parametric tests T-test & factor analysis Better than non parametric tests Stringent assumptions More strings attached Assumes population distribution of sample is normal Major problem Alternatives Continue
More informationMultiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationUNDERSTANDING THE TWO-WAY ANOVA
UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables
More informationPRINCIPAL COMPONENT ANALYSIS
1 Chapter 1 PRINCIPAL COMPONENT ANALYSIS Introduction: The Basics of Principal Component Analysis........................... 2 A Variable Reduction Procedure.......................................... 2
More informationLogistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests
Logistic Regression http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy
More informationChapter 15. Mixed Models. 15.1 Overview. A flexible approach to correlated data.
Chapter 15 Mixed Models A flexible approach to correlated data. 15.1 Overview Correlated data arise frequently in statistical analyses. This may be due to grouping of subjects, e.g., students within classrooms,
More informationSilvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com
SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING
More informationMultiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.
Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear. In the main dialog box, input the dependent variable and several predictors.
More informationThe Dummy s Guide to Data Analysis Using SPSS
The Dummy s Guide to Data Analysis Using SPSS Mathematics 57 Scripps College Amy Gamble April, 2001 Amy Gamble 4/30/01 All Rights Rerserved TABLE OF CONTENTS PAGE Helpful Hints for All Tests...1 Tests
More informationLinear Models in STATA and ANOVA
Session 4 Linear Models in STATA and ANOVA Page Strengths of Linear Relationships 4-2 A Note on Non-Linear Relationships 4-4 Multiple Linear Regression 4-5 Removal of Variables 4-8 Independent Samples
More informationNonlinear Iterative Partial Least Squares Method
Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for
More informationPrincipal Component Analysis
Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded
More information4. There are no dependent variables specified... Instead, the model is: VAR 1. Or, in terms of basic measurement theory, we could model it as:
1 Neuendorf Factor Analysis Assumptions: 1. Metric (interval/ratio) data 2. Linearity (in the relationships among the variables--factors are linear constructions of the set of variables; the critical source
More informationExploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016
and Principal Components Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016 Agenda Brief History and Introductory Example Factor Model Factor Equation Estimation of Loadings
More informationCHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.
CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationFactor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models
Factor Analysis Principal components factor analysis Use of extracted factors in multivariate dependency models 2 KEY CONCEPTS ***** Factor Analysis Interdependency technique Assumptions of factor analysis
More informationExploratory Factor Analysis
Introduction Principal components: explain many variables using few new variables. Not many assumptions attached. Exploratory Factor Analysis Exploratory factor analysis: similar idea, but based on model.
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationLeast Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationbusiness statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar
business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationEngineering Problem Solving and Excel. EGN 1006 Introduction to Engineering
Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques
More informationNotes on Applied Linear Regression
Notes on Applied Linear Regression Jamie DeCoster Department of Social Psychology Free University Amsterdam Van der Boechorststraat 1 1081 BT Amsterdam The Netherlands phone: +31 (0)20 444-8935 email:
More informationFactor analysis. Angela Montanari
Factor analysis Angela Montanari 1 Introduction Factor analysis is a statistical model that allows to explain the correlations between a large number of observed correlated variables through a small number
More information