Didacticiel - Études de cas

Save this PDF as:
Size: px
Start display at page:

Transcription

1 1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition. Indeed, it has interesting properties: it is rather fast on large bases; it can handle naturally multi-class problems (target attribute with more than 2 values); it generates a linear classifier linear, easy to interpret; it is robust and fairly stable, even applied on small databases; it has an embedded variable selection mechanism. Personally, I appreciate linear discriminant analysis because we can have multiple interpretations (probabilistic, geometric), and thus highlights various aspects of supervised learning. Discriminant analysis is both a descriptive and a predictive method 1. In the first case, we say Canonical Discriminant Analysis. We can consider the approach as a dimension reduction technique (a factor analysis). We want to highlight latent variables which explain the difference between the classes defined by the target attribute. In the second case, we can consider the approach as a supervised learning algorithm which intends to predict efficiently the class membership of individuals. Because we have a linear combination of the variables, we have a linear classifier. The purposes are therefore not intrinsically identical even if, when we analyze deeply the underlying formulas, we realize that the two approaches are closely related. Some bibliographic references maintain anyway the confusion by presenting them in a single framework. Tanagra differentiates clearly the two approaches by providing two separate components: LINEAR DISCRIMINANT ANALYSIS (SPV LEARNING tab) for the prediction approach; CANONICAL DISCRIMINANT ANALYSIS (FACTORIAL ANALYSIS tab) for the descriptive (factorial) approach. It is the same for SAS software with respectively DISCRIM and CANDISC procedures 2. Others combine them. This is the case for SPSS and R, mixing results which refer to different goals. For specialists who know how to distinguish important elements depending on the context, this amalgam is not a problem. For beginners, it is a bit more problematic. One can be disturbed by results which do not seem directly related to the purposes of the study. In this tutorial, we detail in a first time with the TANAGRA outputs about Predictive Linear Discriminant Analysis. In a second time, we compare them to the results of R, SAS and SPSS. The objective is to identify important information for predictive analysis i.e. get a simple classification system, get indications on the influence (for the interpretation) and the relevance of variables (statistical significance), and dispose of a variable selection mechanism. 2 Dataset We use the «alcohol.xls» data file. We want to predict the alcohol type (Kirsch, Mirabelle and Pear) from their composition (butanol, etc.; 6 descriptors). The sample contains 77 instances The distinction between the two approaches is clearly defined on the French version of Wikipedia ( novembre 2012 Page 1

2 3 Linear discriminant analysis under Tanagra 3.1 Importing dataset After we launch Tanagra, we create a new diagram by clicking on the FILE / NEW menu. We select the alcohol.xls data file. Tanagra shows the number of instances (77) and the number of variables (7, including the class attribute TYPE) which are loaded. 4 novembre 2012 Page 2

3 3.2 Linear discriminant analysis The DEFINE STATUS enables to specify the status of the variables in the analysis: TYPE is the target attribute (TARGET); the others are the descriptors (INPUT). We add the LINEAR DISCRIMINANT ANALYSIS (SPV LEARNING tab) into the diagram. We click on the contextual menu VIEW to obtain the results. 4 novembre 2012 Page 3

4 3.2.1 Confusion matrix and resubstitution error rate By applying the classifier to the learning sample, we obtain the confusion matrix: (9 + 6) = 15 instances are misclassified. The resubstitution error rate is 19.48% = 15 / 77. Because we use the same dataset for the learning and the evaluation of the classifier, it is well known that the resubstitution error rate is often optimistic Statistical evaluation of the classifier The Wilks' lambda statistic is used for the overall evaluation of the model. It evaluates the gap between the group centroids (the groups defined by the target attribute). When its value is close to 0, we expect that we get an efficient classifier. This point of view is closely related to the MANOVA 3. MANOVA Stat Value p-value Wilks' Lambda Bartlett -- C(12) Rao -- F(12, 138) In our case, = , this is a rather a good value (the worst value is 1). To evaluate the significance of the gap between the centroids, we use the Bartlett's C or the Rao's F transformations (C = , d.f. = 12; ² distribution; F = , d.f. 1 = 12 and d.f. 2 = 138; Fisher distribution). At the 5% level, we reject the null hypothesis: the group centroids are significantly different. By coupling this statistical test with the analysis of the confusion matrix, we understand that the good behavior of the model relies primarily on the situation of KIRSCH which we can detect perfectly. The descriptive analysis will confirm this result Classification functions The following table gives the classification functions (Figure 1). They are used when we want classifying instances (KIRSCH, MIRAB or POIRE) novembre 2012 Page 4

5 Classification functions Attribute KIRSCH MIRAB POIRE MEOH ACET BU MEPR ACAL LNPRO constant Figure 1 Classification functions - Model without variable selection Let us consider a new unlabeled instance to classify. MEOH ACET BU1 MEPR ACAL LNPRO We apply the classification function for each class value. For instance, for KIRSCH we have: S(, KIRSCH) = x x x x x x = We apply the function for each class, we obtain: KIRSCH MIRAB POIRE S(.) MIRAB has the highest score [S(MIRAB) = ]. We assign the value MIRAB to this instance. This classification operation is one of the main goals of the predictive analytics process Assessing the influence of the variables in the model The influence of the predictive variables is not the same. The discriminant analysis can measure their contribution in the model. Tanagra shows these effects in the Statistical Evaluation part of the coefficients table (green). Classification functions Attribute KIRSCH MIRAB POIRE Wilks L. Partial L. F(2,69) p-value M EOH ACET BU M EPR ACAL LNPRO constant Statistical Evaluation Remember that the overall lambda of the model is = (Section 3.2.2). The first column ( Wilks L. ) indicates the lambda of the model if we remove the variable. For instance, if we remove MEOH from the classifier, the new value of the lambda for the model containing all the predictive variables except MEOH is {-MEOH} = The higher is the difference with the initial value of lambda, the most significant is the variable. Partial lambda is the ratio between these values. For instance, Partial L. {-MEOH} = / = novembre 2012 Page 5

6 The last two columns are dedicated to the checking of the contribution of each variable. They are based on the comparison of the lambda with and without the variable that we intend to evaluate. For our dataset, only ACAL is not significant at the 5% level (p-value {-ACAL} = > 5%). 3.3 Variable selection We can select the relevant variables by using a stepwise approach. For this, we use the STEPDISC (FEATURE SELECTION tab) component. Tanagra can implement forward and backward strategies. We choose the forward strategy for our dataset. We start with the null model. We add sequentially the most significant variable as long as it is significant at the 1% level. 4 novembre 2012 Page 6

7 Because we add STEPDISC after DEFINE STATUS 1 into the diagram, it searches the relevant variables among those defined as INPUT into the preceding component. We click on the VIEW menu to obtain the results, 3 variables are selected: BU1, MEPR, MEOH Tanagra provides the details of the search processing. We will describe them later (cf. SAS outputs). All we need to do is to add again the LINEAR DISCRIMINANT ANALYSIS component after STEPDISC into the diagram. Tanagra performs the learning process on the selected variables only. The resubstitution error rate is 25.97%. It seems worse than the model with all the predictive attributes (19.48%). But we know that the resubstitution error rate is not a good indicator of the performance of the models. It often favors the complex model incorporating a large number of predictive variables. We must use resampling approach (e.g. cross validation, bootstrap) to obtain a reliable evaluation of the error rate enabling to compare the models. Here are the coefficients of the new classification functions. Classification functions Attribute KIRSCH MIRAB POIRE Wilks L. Partial L. F(2,72) p-value BU MEPR MEOH constant Statistical Evaluation Figure 2 Classification functions Model after variable selection We apply the new model on the instance to classify (section 3.2.3, the unused variables are grayed): MEOH ACET BU1 MEPR ACAL LNPRO We obtain the following scores. Here also, we assign the instance to the MIRAB class. KIRSCH MIRAB POIRE S(.) novembre 2012 Page 7

8 3.4 Canonical discriminant analysis (factorial discriminant analysis) Descriptive discriminant analysis is not the main topic of this tutorial. But we present nevertheless the outputs of Tanagra to better understand the results provided of the other tools. We add the CANONICAL DISCRIMINANT ANALYSIS (FACTORIAL ANALYSIS tab) component into the diagram. Then we click on the VIEW menu to obtain the results Eigenvalues table This table provides the eigenvalues for each factor. We have also the proportion of explained variance. The significance test of each factor is provided. Roots and Wilks' Lambda Root Eigenvalue Proportion Canonical R Wilks Lambda CHI-2 d.f. p-value Canonical coefficients The raw canonical coefficients enable to calculate the coordinate of the individuals on the factors. The unstandardized coefficients can be applied on the untransformed values of the variables. Canonical Discriminant Function Coefficients Unstandardized Attribute Root n1 Root n2 Root n1 Root n2 MEOH ACET BU MEPR ACAL LNPRO constant For the instance with the following description: Figure 3 Raw canonical coefficients Standardized 4 novembre 2012 Page 8 -

9 Canonical component 2 Didacticiel - Études de cas MEOH ACET BU1 MEPR ACAL LNPRO The coordinate on the first factor is, Axe 1 = x x x x x x = On the second factor, Axe 2 = x x = Canonical components scatter plot KIRSCH MIRAB POIRE Canonical component 1 By applying these functions on the individuals of the learning sample, we obtain a representation of the instances into the two first axes. By coloring the points according to its group membership, we can evaluate the class separability. We note that KIRSCH does not overlap of the others. This confirms the results of the confusion matrix obtained earlier (section 3.2.1) Canonical structure The canonical structure corresponds to the correlation between the variables and the factors. There are different ways to compute this correlation: ignoring the class membership (TOTAL); controlling the class membership (WITHIN); highlighting the class membership (BETWEEN). Factor Structure Matrix - Correlations Root Root n1 Root n2 Descriptors Total Within Between Total Within Between M EOH ACET BU M EPR ACAL LNPRO At a first glance, we observe that: (1) the distinction between KIRSCH and the others relies mainly on MEOH and BU1 on the first axis; (2) the distinction between MIRAB and POIRE on the second axis is due to ACET and MEPR. 4 novembre 2012 Page 9

10 Canonical component 2 Didacticiel - Études de cas These results are consistent with the predictive approach where BU1, MEPR and MEOH were the selected variables after the STEPDISC FORWARD process at 1% level Class means on canonical variables Tanagra provides the means on the axis of the three groups. Group centroids on the canonical variables TYPE Root n1 Root n2 KIRSCH M IRAB POIRE Sq Canonical corr It is especially interesting to visualize the class centroids within the graphical representation of the individuals. We observe that the closest group centroid to the instance to classify with the coordinates ( , ) is MIRAB ( , ). Canonical components scatter plot Group centroids 4 3 KIRSCH MIRAB POIRE 2 MIRAB 1 KIRSCH POIRE Canonical component Classification rule Method 1 How to assign the individual with the coordinates ( , ) to one of the classes? Visually, we observe that the MIRAB centroid ( , ) is the closest to the instance to classify (with the coordinates: , ). But we must confirm this visual impression with calculations. To obtain a classification consistent with the one of the predictive discriminant analysis, we need in addition to the group centroids - to know the proportion of the groups into the learning sample. TYPE Proportion KIRSCH MIRAB POIRE novembre 2012 Page 10

11 For the instance that we want to classify, we calculate the squared generalized distance to the centroids: Where c is one the classes, K is the number of factors, f k ( ) is the coordinate of the instance on the factor k, kc is the conditional mean of the class c on the factor k, π c is the proportion of the class c in the learning sample. Thus, for the various classes, we obtain: D²(, KIRSCH) = ( )² + ( )² - 2 ln( ) = D²(, MIRAB) = ( )² + ( )² - 2 ln( ) = D²(, POIRE) = ( )² + ( )² - 2 ln( ) = We assign the individual to the class for which the centroid is the closest. In this case, this is MIRAB since D²(, MIRAB) takes the smallest value. Furthermore, we can calculate the posterior probability of the class membership: For the instance above: P(Y( ) = KIRSCH / X) = / = P(Y( ) = MIRAB / X) = / = P(Y( ) = POIRE / X) = / = We assign the instance to the most likely group Classification rule - Method 2 By developing the formulas above and by multiplying them by -0.5, we obtain a linear classification functions based on the factorial coordinates. They are equivalent to the classification function provided by the predictive discriminant analysis, to within a constant which does not depend to the class membership. So, the classification characteristic is exactly the same. We have: For the instance to classify: S (, KIRSCH) = x x ( ) ( ² + ( )²)/2 + ln( ) = S (, MIRAB) = x ( ) x (( )² ²)/2 + ln( ) = S (, POIRE) = x ( ) x ( ) (( )² + ( ) ²)/2 + ln( ) = We assign the instance to the MIRAB class. The result is necessarily consistent with that of the predictive discriminant analysis (sections and 3.4.5). 4 novembre 2012 Page 11

12 3.4.7 Classification rule - Return on the original variables - Method 3 In the previous section, the classification functions are a linear combination of the factors. These last ones are a linear combination of the original variables. So, we can produce a classification function defined on the original predictive variables. The classification functions on factors are the following: Score function 1 KIRSCH MIRAB POIRE f f Const For instance, we have for KIRSH: S (KIRSH) = x f x f The factors F1 are F2 are linear combinations of original predictive attributes (Figure 3). We can thus produce a new version of the classification functions defined on the predictive attributes S (.): Score function 2 KIRSCH MIRAB POIRE MEOH ACET BU MEPR ACAL LNPRO constant We can compare them to the classification functions obtained from the predictive linear discriminant analysis S(.) (Figure 1). We observe that the coefficients of S() and S'() are different, but the gap between the classes is the same for each variable. S() - Analyse prédictive S'() - Déduite de l'analyse descriptive Ecarts entre coefficients Attribute KIRSCH MIRAB POIRE KIRSCH MIRAB POIRE KIRSCH MIRAB POIRE MEOH ACET BU MEPR ACAL LNPRO constant This is the reason for which we have not the same score value for the individual S(, KIRSCH) S (, KIRSCH) ; S(, MIRAB) S (, MIRAB) ; S(, POIRE) S (, POIRE) But the difference depends on the individual and not to the class membership. [S(, KIRSCH) - S (, KIRSCH)] = [S(, MIRAB) - S (, MIRAB)] = [S(, POIRE) - S (, POIRE)] In the end, the instances are classified in identical way. This is the most important. 4 Discriminant analysis under SAS 4.1 Proc DISCRIM We perform the same analysis under SAS 9.3 using PROC DISCRIM. We use the following commands. 4 novembre 2012 Page 12

13 proc discrim data = alcohol; run; class type; var MEOH ACET BU1 MEPR ACAL LNPRO1; priors proportional; The PRIORS option enables to use the proportion measured on the learning set as classes prior distribution in the learning process. METHOD = NORMAL (multivariate normal distribution) and POOL = YES (used the pooled covariance matrix) are two important options for which the default values were used. Overall description. SAS provides a description of the problem that we handle. We observe, among others, the proportion of classes into the learning sample. Distances between the group centroids. We have the squared generalized distance between the group centroids. The distance between the centroids of the same group is not null. In addition it is not symmetric! 4 novembre 2012 Page 13

14 This is really surprising. This is because the proportion of the groups is used in the formula 4. For instance, the distance of the KIRSCH with itself is obtained with (see section for the values of the centroids): D²(KIRSCH, KIRSCH) = ( )² + ( )² 2 x ln( ) = The generalized distance is not symmetric. For instance, between KIRSCH and MIRAB: D²(KIRSCH, MIRAB) = ( )² + ( )² - 2 x ln( ) = D²(MIRAB, KIRSCH) = ( )² + ( )² -2 x ln( ) = Classification functions. SAS provides the same classification functions as Tanagra. But SAS does not provide any information about the relevance of the variables. Confusion matrix and resubstitution error rate. Last, SAS provides the confusion matrix novembre 2012 Page 14

15 TANAGRA SAS Didacticiel - Études de cas 4.2 Variable selection with STEPDISC SAS proposes STEPDISC, a tool for the variable selection which is consistent with the linear discriminant analysis principle. We perform a forward selection at 1% level (METHOD = FORWARD, SLENTRY = 1%): proc stepdisc data = alcohol method = forward slentry = 0.01; class type; var MEOH ACET BU1 MEPR ACAL LNPRO1; run; We compare the detailed results with those of Tanagra (Figure 4). Figure 4 Detailed results - Stepdisc at 1% level - Tanagra Step 1: SAS picks BU1 after having listed the contributions of the candidate variables. 4 novembre 2012 Page 15

16 TANAGRA SAS Didacticiel - Études de cas SAS provides R-Square = 1 -. For instance, R-Square(MEOH) = = The F statistic is the same as Tanagra, it enables to check the significance of the variable e.g. F(MEOH) = 64.82, with the p-value(meoh) = The tolerance statistic characterizes the redundancy with the variables already included in the model. At the first step, the initial set of selected variables is empty. There is no possible redundancy. It is therefore equal to 1 for all variables. BU1 has the highest "F Value", and it is significant at the 1% level. It is thus included in the model. In the forward process, this decision is unchangeable. We cannot remove this variable later. SAS pursues with the assessment of the current model The stepdisc documentation explains the reading of the various statistics (Pillai, Average Squared Canonical Correlation) provided by SAS. There is one variable at this step. Thus, the F Value is the same as the one of the BU1 variable in the table Stepdisc Procedure Step 1 above. Step 2: SAS evaluates the contribution of the remaining variables. The "Partial R-Square" compares the lambda of the models containing or not an additional variable e.g. for MEOH, Partial R- Square(MEOH) = / MEPR is the best variable according the Partial R-Square (or according to the F Value). It is significant at the 1% level. It is approved. We note also, even if this is not taken into account for the selection, that MEPR is weakly correlated with the already introduced variable because its tolerance is It means that the squared correlation between BU1 and MEPR is r² = = For the overall evaluation of the model with 2 (BU1, MEPR) predictive variables, we have: 4 novembre 2012 Page 16

17 SAS continues until it is no more possible to add variables. It then produces a table summarizing the process (Figure 5). The values (Wilks lambda, F Value) are consistent with those of Tanagra. SAS TANAGRA Figure 5 Summary of the selection process - Stepdisc forward at 1% level The selected variables are: BU1, MEPR and MEOH. 5 Discriminant Analysis under R lda() [MASS package] We use the lda() procedure of the «MASS» package to perform the linear discriminant analysis. This package is automatically installed with R. 5.1 Importing the dataset The read.xlsx() command of the package xlsx 5 enables to read a data file in the Excel format (XLS or XLSX). The summary() command describes shortly the variables of the dataset. library(xlsx) #sheetindex: number of the sheet to read #header: the first row corresponds to the name of the variables alcohol.data <- read.xlsx(file="alcohol.xls",sheetindex=1,header=t) print(summary(alcohol.data)) We obtain the main features of the variables novembre 2012 Page 17

18 TYPE is the categorical attribute (3 values) that we want to explain. 5.2 Using the lda() procedure We load the MASS 6 package. We start the lda() procedure on the learning sample. #linear discriminant analysis library(mass) alcohol.lda <- lda(type ~., data = alcohol.data) print(alcohol.lda) lda() provides: the proportions of the classes; the conditional mean of the variables according to the classes; the unstandardized coefficients of the canonical functions (Section 3.4.2), without the constant term; the proportion of variance explained by each factor novembre 2012 Page 18

19 In principle, lda() seems only intended to the descriptive analysis. But as we have seen above, the connection between the descriptive analysis and the predictive analysis is strong (section 3.4.5). Thus, the predict() command enables to assign individual to the classes. The classification properties are identical to those of Tanagra or SAS i.e. each individual is assigned to the same group whatever the tool used. 5.3 Prediction The predict() enables to apply the classifier on the individuals. #prediction on the training set pred.lda <- predict(alcohol.lda,newdata=alcohol.data) print(attributes(pred.lda)) The object has 3 main properties: class, posterior and x. class is a vector, its length is n = 77. It provides the predicted value for each individual of the sample. For instance, we show here the predicted values for the first 6 individuals. posterior is a matrix with n = 77 rows and 3 columns (because the class attribute has 3 values). Each column corresponds to the posterior probability of the class membership. For the first 6 instances, we have the following values: x provides the canonical coordinates of the instances. This is a matrix with 77 rows and 2 columns (because we have two factors). Here are the values for the first 6 instances: 4 novembre 2012 Page 19

20 5.4 Confusion matrix To obtain the confusion matrix, we create a contingency table from the observed values of the class attribute and the predictions of the model. #confusion matrix mc.lda <- table(alcohol.data\$type,pred.lda\$class) print(mc.lda) We have the same matrix as Tanagra (section 3.2.1) and SAS (section 4.1). We can compute the error rate, #error rate print(1-sum(diag(mc.lda))/sum(mc.lda)) We have 19.48% 5.5 Variable selection We use the proceduree greedy.wilks() of the package «klar» 7 for the variable selection. #variable selection library(klar) alcohol.forward <- greedy.wilks(type ~.,data=alcohol.data, niveau = 0.01) novembre 2012 Page 20

21 print(alcohol.forward) At each step, R provides the F statistic for the additional variable (F.statistics.diff), and the F statistic for the model with the current set of selected variables (F.statistics.overall). The process is consistent with those of Tanagra and SAS (Figure 5). We perform a new analysis on the variables selected by the greedy.wilks() procedure. This last one provides directly the formula with only the relevant variables. This feature is really useful if the initial number of candidate variables is very large. #2nd model after variable selection alcohol.lda.fwd <- lda(alcohol.forward\$formula, data = alcohol.data) print(alcohol.lda.fwd) We obtain a new version of the model. We apply this model on the sample to obtain the corresponding confusion matrix and error rate. 4 novembre 2012 Page 21

22 #2nd confusion matrix mc.lda.fwd <- table(alcohol.data\$type,predict(alcohol.lda.fwd,newdata=alcohol.data)\$class) print(mc.lda.fwd) #2nd error rate print(1-sum(diag(mc.lda.fwd))/sum(mc.lda.fwd)) We obtain: 6 Discriminant analysis under SPSS We use the French version of SPSS We import the alcohol.xls data file. We transform TYPE in a numerical attribute before (KIRSCH = 1, MIRAB = 2, POIRE = 3) Construction of the model including all the predictive variables We click on the ANALYSE / CLASSIFICATION / ANALYSE DISCRIMINANTE menu. We set into the settings dialog: 8 See also Annotated SPSS Output Discriminant Analysis. 4 novembre 2012 Page 22

23 We set TYPE in «Critère de regroupement», the range (Définir intervalle) of the value is MIN = 1 to MAX = 3. All the others variables are predictive variables «Variables explicatives». At the moment, we do not perform a variable selection «Entrer les variables simultanément». Then, we click on the «Statistiques» button. We ask the Fisher classification function. From the initial dialog box, we click on the «Classements» button. We want to estimate the prior probabilities of the groups from the proportion measured on the learning sample. We use the pooled covariance matrix. We validate these settings. A new window describing the results of the calculations appears. SPSS mixes the outputs of the canonical analysis and predictive analysis. This is not a problem. Indeed, the two approaches can converge as we shown previously. We must know simply discern the appropriate information in the outputs. 4 novembre 2012 Page 23

24 Firstly, we have the results in the descriptive point of view. We have the eigenvalues related to the factors and the tests of significance. Then, we have: (1) the canonical functions; (2) the canonical structure; (3) the conditional centroids. (1) (2) (3) 4 novembre 2012 Page 24

25 Secondly, we have the results in the predictive point of view. We have, among others, the classification functions: the Fisher s Linear Discriminant Functions according to SPSS. 6.2 Variable selection To perform a variable selection, we must restart the analysis by modifying the treatment of the independent variables. We select the stepwise approach. We click on the Méthode button. Only the bidirectional approach is available i.e. at each step, we check if the adding of a variable does not imply the removing of another already selected variable. Two significance levels enable to guide the process: 0.01 for the adding, 0.05 for the removing. We select the Wilks lambda (Méthode) to obtain consistent results with the other tools. 4 novembre 2012 Page 25

26 A table summarizes the process. It is very similar to the table provided by the procedure greedy.wilks() (KlaR package) under R. Then we have more details on the various versions of the models. As we seen above, the tolerance criterion enables to evaluate the redundancy between the selected variables. It varies between 0 and 1. The higher is the value, the weaker is the redundancy between the variables. In the last column, we have the Wilks' lambda of the model if we remove a variable. Last, a table provides a full detailed description of the process. We observe for instance that BU1 is the best variable in the first step. It proposes the weaker value of the Wilks' lambda (0.299). In the second step, MEPR is the most interesting (0.249). Etc. 4 novembre 2012 Page 26

27 7 Conclusion Discriminant analysis is an attractive method. It is available in almost all statistical software. In this tutorial, we tried to highlight the similarities and the differences between the outputs of Tanagra, R, SAS, and SPSS software. The main conclusion is that, if the presentation is not always the same, ultimately we have exactly the same results. This is the most important. 4 novembre 2012 Page 27

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

Tutorial Segmentation and Classification

MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION 1.0.8 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel

Didacticiel - Études de cas

1 Topic Regression analysis with LazStats (OpenStat). LazStat 1 is a statistical software which is developed by Bill Miller, the father of OpenStat, a wellknow tool by statisticians since many years. These

Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

DISCRIMINANT FUNCTION ANALYSIS (DA)

DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant

Didacticiel Études de cas

1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear. In the main dialog box, input the dependent variable and several predictors.

Multivariate Analysis of Variance (MANOVA)

Chapter 415 Multivariate Analysis of Variance (MANOVA) Introduction Multivariate analysis of variance (MANOVA) is an extension of common analysis of variance (ANOVA). In ANOVA, differences among various

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

Multivariate Analysis of Variance (MANOVA): I. Theory

Gregory Carey, 1998 MANOVA: I - 1 Multivariate Analysis of Variance (MANOVA): I. Theory Introduction The purpose of a t test is to assess the likelihood that the means for two groups are sampled from the

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

1 Topic. 2 Scilab. 2.1 What is Scilab?

1 Topic Data Mining with Scilab. I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction

Moderation. Moderation

Stats - Moderation Moderation A moderator is a variable that specifies conditions under which a given predictor is related to an outcome. The moderator explains when a DV and IV are related. Moderation

2 Decision tree + Cross-validation with R (package rpart)

1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing

Introduction to Principal Components and FactorAnalysis

Introduction to Principal Components and FactorAnalysis Multivariate Analysis often starts out with data involving a substantial number of correlated variables. Principal Component Analysis (PCA) is a

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

Association Between Variables

Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

Overview of Factor Analysis

Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

Figure 1. An embedded chart on a worksheet.

8. Excel Charts and Analysis ToolPak Charts, also known as graphs, have been an integral part of spreadsheets since the early days of Lotus 1-2-3. Charting features have improved significantly over the

Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Calculate counts, means, and standard deviations Produce

Non-Technical Components of Software Development and their Influence on Success and Failure of Software Development

Non-Technical Components of Software Development and their Influence on Success and Failure of Software Development S.N.Geethalakshmi Reader, Department of Computer Science, Avinashilingam University for

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

CHAPTER 7B Multiple Regression: Statistical Methods Using IBM SPSS This chapter will demonstrate how to perform multiple linear regression with IBM SPSS first using the standard method and then using the

COC131 Data Mining - Clustering

COC131 Data Mining - Clustering Martin D. Sykora m.d.sykora@lboro.ac.uk Tutorial 05, Friday 20th March 2009 1. Fire up Weka (Waikako Environment for Knowledge Analysis) software, launch the explorer window

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

SIMPLE LINEAR CORRELATION Simple linear correlation is a measure of the degree to which two variables vary together, or a measure of the intensity of the association between two variables. Correlation

Scatter Plots with Error Bars

Chapter 165 Scatter Plots with Error Bars Introduction The procedure extends the capability of the basic scatter plot by allowing you to plot the variability in Y and X corresponding to each point. Each

Multivariate Analysis of Variance (MANOVA)

Multivariate Analysis of Variance (MANOVA) Aaron French, Marcelo Macedo, John Poulsen, Tyler Waterson and Angela Yu Keywords: MANCOVA, special cases, assumptions, further reading, computations Introduction

Data exploration with Microsoft Excel: analysing more than one variable

Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

Multivariate Statistical Inference and Applications

Multivariate Statistical Inference and Applications ALVIN C. RENCHER Department of Statistics Brigham Young University A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

Tutorial on Exploratory Data Analysis

Tutorial on Exploratory Data Analysis Julie Josse, François Husson, Sébastien Lê julie.josse at agrocampus-ouest.fr francois.husson at agrocampus-ouest.fr Applied Mathematics Department, Agrocampus Ouest

In this tutorial, we try to build a roc curve from a logistic regression.

Subject In this tutorial, we try to build a roc curve from a logistic regression. Regardless the software we used, even for commercial software, we have to prepare the following steps when we want build

Multiple Linear Regression

Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is

ANALYTIC HIERARCHY PROCESS (AHP) TUTORIAL

Kardi Teknomo ANALYTIC HIERARCHY PROCESS (AHP) TUTORIAL Revoledu.com Table of Contents Analytic Hierarchy Process (AHP) Tutorial... 1 Multi Criteria Decision Making... 1 Cross Tabulation... 2 Evaluation

Data representation and analysis in Excel

Page 1 Data representation and analysis in Excel Let s Get Started! This course will teach you how to analyze data and make charts in Excel so that the data may be represented in a visual way that reflects

Two Correlated Proportions (McNemar Test)

Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with

Factor Analysis. Chapter 420. Introduction

Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.

Using Excel for descriptive statistics

FACT SHEET Using Excel for descriptive statistics Introduction Biologists no longer routinely plot graphs by hand or rely on calculators to carry out difficult and tedious statistical calculations. These

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Chapter 4 Two-Sample T-Tests Assuming Equal Variance (Enter Means) Introduction This procedure provides sample size and power calculations for one- or two-sided two-sample t-tests when the variances of

Simple Linear Regression Inference

Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

CLUSTER ANALYSIS. Kingdom Phylum Subphylum Class Order Family Genus Species. In economics, cluster analysis can be used for data mining.

CLUSTER ANALYSIS Introduction Cluster analysis is a technique for grouping individuals or objects hierarchically into unknown groups suggested by the data. Cluster analysis can be considered an alternative

SPSS Guide: Regression Analysis

SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar

Below is a very brief tutorial on the basic capabilities of Excel. Refer to the Excel help files for more information.

Excel Tutorial Below is a very brief tutorial on the basic capabilities of Excel. Refer to the Excel help files for more information. Working with Data Entering and Formatting Data Before entering data

Tutorial: Get Running with Amos Graphics

Tutorial: Get Running with Amos Graphics Purpose Remember your first statistics class when you sweated through memorizing formulas and laboriously calculating answers with pencil and paper? The professor

Using Excel in Research. Hui Bian Office for Faculty Excellence

Using Excel in Research Hui Bian Office for Faculty Excellence Data entry in Excel Directly type information into the cells Enter data using Form Command: File > Options 2 Data entry in Excel Tool bar:

A Property & Casualty Insurance Predictive Modeling Process in SAS

Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

6 Variables: PD MF MA K IAH SBS

options pageno=min nodate formdlim='-'; title 'Canonical Correlation, Journal of Interpersonal Violence, 10: 354-366.'; data SunitaPatel; infile 'C:\Users\Vati\Documents\StatData\Sunita.dat'; input Group

Chapter 23. Inferences for Regression

Chapter 23. Inferences for Regression Topics covered in this chapter: Simple Linear Regression Simple Linear Regression Example 23.1: Crying and IQ The Problem: Infants who cry easily may be more easily

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

Factor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models

Factor Analysis Principal components factor analysis Use of extracted factors in multivariate dependency models 2 KEY CONCEPTS ***** Factor Analysis Interdependency technique Assumptions of factor analysis

Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA

PROC FACTOR: How to Interpret the Output of a Real-World Example Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA ABSTRACT THE METHOD This paper summarizes a real-world example of a factor

Leveraging Ensemble Models in SAS Enterprise Miner

ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

Tutorial: Get Running with Amos Graphics

Tutorial: Get Running with Amos Graphics Purpose Remember your first statistics class when you sweated through memorizing formulas and laboriously calculating answers with pencil and paper? The professor

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln Log-Rank Test for More Than Two Groups Prepared by Harlan Sayles (SRAM) Revised by Julia Soulakova (Statistics)

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Doing Multiple Regression with SPSS Multiple Regression for Data Already in Data Editor Next we want to specify a multiple regression analysis for these data. The menu bar for SPSS offers several options:

Directions for using SPSS

Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...

An introduction to using Microsoft Excel for quantitative data analysis

Contents An introduction to using Microsoft Excel for quantitative data analysis 1 Introduction... 1 2 Why use Excel?... 2 3 Quantitative data analysis tools in Excel... 3 4 Entering your data... 6 5 Preparing

Appendix 2.1 Tabular and Graphical Methods Using Excel

Appendix 2.1 Tabular and Graphical Methods Using Excel 1 Appendix 2.1 Tabular and Graphical Methods Using Excel The instructions in this section begin by describing the entry of data into an Excel spreadsheet.

Multivariate Analysis

Table Of Contents Multivariate Analysis... 1 Overview... 1 Principal Components... 2 Factor Analysis... 5 Cluster Observations... 12 Cluster Variables... 17 Cluster K-Means... 20 Discriminant Analysis...

Profile analysis is the multivariate equivalent of repeated measures or mixed ANOVA. Profile analysis is most commonly used in two cases:

Profile Analysis Introduction Profile analysis is the multivariate equivalent of repeated measures or mixed ANOVA. Profile analysis is most commonly used in two cases: ) Comparing the same dependent variables

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

Chapter 45 Two-Sample T-Tests Allowing Unequal Variance (Enter Difference) Introduction This procedure provides sample size and power calculations for one- or two-sided two-sample t-tests when no assumption

A Brief Introduction to SPSS Factor Analysis

A Brief Introduction to SPSS Factor Analysis SPSS has a procedure that conducts exploratory factor analysis. Before launching into a step by step example of how to use this procedure, it is recommended

FACTOR ANALYSIS NASC

FACTOR ANALYSIS NASC Factor Analysis A data reduction technique designed to represent a wide range of attributes on a smaller number of dimensions. Aim is to identify groups of variables which are relatively

COMPUTING AND VISUALIZING LOG-LINEAR ANALYSIS INTERACTIVELY

COMPUTING AND VISUALIZING LOG-LINEAR ANALYSIS INTERACTIVELY Pedro Valero-Mora Forrest W. Young INTRODUCTION Log-linear models provide a method for analyzing associations between two or more categorical

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:

5.2 Customers Types for Grocery Shopping Scenario

------------------------------------------------------------------------------------------------------- CHAPTER 5: RESULTS AND ANALYSIS -------------------------------------------------------------------------------------------------------

DISCRIMINANT ANALYSIS AND THE IDENTIFICATION OF HIGH VALUE STOCKS

DISCRIMINANT ANALYSIS AND THE IDENTIFICATION OF HIGH VALUE STOCKS Jo Vu (PhD) Senior Lecturer in Econometrics School of Accounting and Finance Victoria University, Australia Email: jo.vu@vu.edu.au ABSTRACT

UCINET Quick Start Guide

UCINET Quick Start Guide This guide provides a quick introduction to UCINET. It assumes that the software has been installed with the data in the folder C:\Program Files\Analytic Technologies\Ucinet 6\DataFiles

Linear Discriminant Analysis

Fiche TD avec le logiciel : course5 Linear Discriminant Analysis A.B. Dufour Contents 1 Fisher s iris dataset 2 2 The principle 5 2.1 Linking one variable and a factor.................. 5 2.2 Linking a

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

SPSS Basics Tutorial 1: SPSS Windows There are six different windows that can be opened when using SPSS. The following will give a description of each of them. The Data Editor The Data Editor is a spreadsheet

TIPS FOR DOING STATISTICS IN EXCEL

TIPS FOR DOING STATISTICS IN EXCEL Before you begin, make sure that you have the DATA ANALYSIS pack running on your machine. It comes with Excel. Here s how to check if you have it, and what to do if you

SPSS: Getting Started. For Windows

For Windows Updated: August 2012 Table of Contents Section 1: Overview... 3 1.1 Introduction to SPSS Tutorials... 3 1.2 Introduction to SPSS... 3 1.3 Overview of SPSS for Windows... 3 Section 2: Entering

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA) As with other parametric statistics, we begin the one-way ANOVA with a test of the underlying assumptions. Our first assumption is the assumption of

15.062 Data Mining: Algorithms and Applications Matrix Math Review

.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

Text Analytics Illustrated with a Simple Data Set

CSC 594 Text Mining More on SAS Enterprise Miner Text Analytics Illustrated with a Simple Data Set This demonstration illustrates some text analytic results using a simple data set that is designed to

DEA implementation and clustering analysis using the K-Means algorithm

Data Mining VI 321 DEA implementation and clustering analysis using the K-Means algorithm C. A. A. Lemos, M. P. E. Lins & N. F. F. Ebecken COPPE/Universidade Federal do Rio de Janeiro, Brazil Abstract

Introduction to Matrix Algebra

Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary

To do a factor analysis, we need to select an extraction method and a rotation method. Hit the Extraction button to specify your extraction method.

Factor Analysis in SPSS To conduct a Factor Analysis, start from the Analyze menu. This procedure is intended to reduce the complexity in a set of data, so we choose Data Reduction from the menu. And the

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

Multivariate Analysis of Variance. The general purpose of multivariate analysis of variance (MANOVA) is to determine

2 - Manova 4.3.05 25 Multivariate Analysis of Variance What Multivariate Analysis of Variance is The general purpose of multivariate analysis of variance (MANOVA) is to determine whether multiple levels