Tutorial on Exploratory Data Analysis
|
|
- Christal Holmes
- 8 years ago
- Views:
Transcription
1 Tutorial on Exploratory Data Analysis Julie Josse, François Husson, Sébastien Lê julie.josse at agrocampus-ouest.fr francois.husson at agrocampus-ouest.fr Applied Mathematics Department, Agrocampus Ouest user-2008 Dortmund, August 11th / 48
2 Why a tutorial on Exploratory Data Analysis? Our research focus on multi-table data analysis We teach Exploratory Data analysis from a long time (but in French...) We made an R package: Possibility to add supplementary information The use of a more geometrical point of view allowing to draw graphs The possibility to propose new methods (taking into account different structure on the data) To have a package user friendly and oriented to practitioner (a very easy GUI) 2 / 48
3 Outline 1 Principal Component Analysis (PCA) 2 Correspondence Analysis (CA) 3 Multiple Correspondence Analysis (MCA) 4 Some extensions Please ask questions! 3 / 48
4 Multivariate data analysis Principal Component Analysis (PCA) continuous variables Correspondence Analysis (CA) contingency table Multiple Correspondence Analysis (MCA) categorical variables Dimensionality reduction describe the dataset with smaller number of variables Techniques widely used for applications such as: data compression, data reconstruction; preprocessing before clustering, and... 4 / 48
5 Exploratory data analysis Descriptive methods Data visualization Geometrical approach: importance to graphical outputs Identification of clusters, detection of outliers French school (Benzécri) 5 / 48
6 Principal Component Analysis 6 / 48
7 PCA in R prcomp, princomp from stats dudi.pca from ade4 ( PCA from FactoMineR ( 7 / 48
8 PCA deals with which kind of data? Notations: Figure: Data table in PCA. xi. the individual i, S = 1 n X cx c the covariance matrix, W = X c X c the inner products matrix. PCA deals with continuous variables, but categorical variables can also be included in the analysis 8 / 48
9 Some examples Many examples Sensory analysis: products - descriptors Environmental data: plants - measurements; waters - physico-chemical analyses Economy: countries - economic indicators Microbiology: cheeses - microbiological analyses etc. Today we illustrate PCA with: data decathlon: athletes performances during two athletics meetings data chicken: genomics data 9 / 48
10 Introduction Individuals cloud Variables cloud Helps to interpret Decathlon data 100m Long.jump Shot.put High.jump 400m 110m.hurdle Discus Pole.vault Javeline 1500m Rank Points Competition 41 athletes (rows) 13 variables (columns): 10 continuous variables corresponding to the performances 2 continuous variables corresponding to the rank and the points obtained PCA Example 1 categorical variable corresponding to the athletics meeting: DataOlympic : performances 41 athletes during two meetings of decathlon Gameofand Decastar (2004) SEBRLE CLAY KARPOV BERNARD YURKOV Decastar Decastar Decastar Decastar Decastar Sebrle Clay Karpov Macey Warners OlympicG OlympicG OlympicG OlympicG OlympicG 10 / 48
11 Problems - objectives Individuals study: similarity between individuals for all the variables (Euclidian distance) partition between individuals Variables study: are there linear relationships between variables? Visualization of the correlation matrix; find some synthetic variables Link between the two studies: characterization of the groups of individuals by the variables; particular individuals to better understand the links between variables 11 / 48
12 Two points clouds X X var 1 ind 1 var k ind k 12 / 48
13 Pre-processing: Mean centering, Scaling? Mean centering does not modify the shape of the cloud Scaling: variables are always scaled when they are not in the same units Long jump (in m) Long jump (in cm) Time for 1500 m (s) Time for 1500 m (s) PCA always centered and often scaled 13 / 48
14 Individuals cloud Individuals are in R K Similarity between individuals: Euclidean distance Study the structure, i.e. the shape of the individual cloud 14 / 48
15 Inertia Total inertia multidimensional variance distance between the data and the barycenter: I I g = p i x i. g 2 = 1 I i=1 = tr(s) = λ s = K. s I (x i. g) (x i. g), i=1 15 / 48
16 Fit the individuals cloud Find the subspace who better sum up the data: the closest one by projection. Figure: Camel vs dromedary? 16 / 48
17 Fit the individuals cloud x i. min P u1 (x i. ) = u 1 (u 1u 1 ) 1 u 1x i. max F u u = < x, u 1 > u 1 F u1 (i) = < x, u 1 > Maximize the variance of the projected data Minimize the distance between individuals and their projections Best representation of the diversity, variability of the individuals Do not distort the distances between individuals 17 / 48
18 Find the subspace (1) Find the first axis u 1, for which variance of F u1 = Xu 1 is maximized: u 1 = argmax u1 var(xu 1 ) with u 1u 1 = 1 It leads to: var(f u1 ) = 1 I (Xu 1) Xu 1 = u 1 1 I X Xu 1 max u 1Su 1 with u 1u 1 = 1 u 1 first eigenvector of S (associated with the largest eigenvalue λ 1 ): Su 1 = λ 1 u 1. This eigenvector is known as the first axis (loadings). 18 / 48
19 Find the subspace (2) Projected inertia on the first axis: var(f u1 ) = var(xu 1 ) = λ 1 Percentage of variance explained by the first axis: λ 1 k λ k Additional axes are defined in an incremental fashion: each new direction is chosen by maximizing the projected variance among all orthogonal directions Solution: K eigenvectors u 1,...,u K of the data covariance matrix corresponding to the K largest eigenvalues λ 1,...,λ K 19 / 48
20 Example: graph of the individuals Individuals factor map (PCA) Dimension 2 (17.37%) Casarsa Parkhomenko Korkizoglou YURKOV Zsivoczky Smith Macey Pogorelov SEBRLE MARTINEAU HERNU CLAY Turi Terek KARPOV BOURGUIGNON Uldal Barras McMULLEN Schoenbeck Bernard Karlivans BARRAS Qi Hernu BERNARD Ojaniemi Smirnov Gomez ZSIVOCZKY Schwarzl Lorenzo Nool WARNERS Averyanov Warners NOOL Drews Sebrle Clay Karpov Dimension 1 (32.72%) Need variables to interpret the dimension of variability 20 / 48
21 Can we interpret the individuals graph with the variables? Correlation between variable x.k and F u (the vector of individuals coordinates in R I ) r(f 2,k) k r(f 1,k) Correlation circle graph 21 / 48
22 Can we interpret the individuals graph with the variables? Variables factor map (PCA) Dimension 2 (17.37%) X400m X110m.hurdle X100m X1500m Javeline Pole.vault Discus Shot.put High.jump Long.jump Dimension 1 (32.72%) Figure: Graph of the variables. 22 / 48
23 Variables are in R I Find the subspace (1) Find the first dimension v 1 (in R I ) which maximizes the variance of the projected data x.k G v v P v1 (x.k ) = v 1 (v 1v 1 ) 1 v 1x.k G v1 (k) = < v 1, x.k > v 1 max K k=1 G v 1 (k) 2 = max K k=1 cor(v 1, x.k ) 2 = max K k=1 cos2 (θ) with v v = 1 v 1 is the best synthetic variable 23 / 48
24 Find the subspace (2) Solution: v 1 is the first eigenvector of W = XX the individuals inner product matrix (associated with the largest eigenvalue λ 1 ): Wv 1 = λ 1 v 1 The next dimensions are the other eigenvectors Dimensionality reduction: principal components are linear combination of the variables A subset of components to sum up the data 24 / 48
25 Fit the variables cloud Variables factor map (PCA) Dimension 2 (17.37%) X400m X110m.hurdle X100m X1500m Javeline Pole.vault Discus Shot.put High.jump Long.jump Dimension 1 (32.72%) Figure: Graph of the variables. Same representation! What a wonderful result! 25 / 48
26 Projections... Only well projected variables (high cos 2 between the variable and its projection) can be interpreted! A H A HB D H AHB H D H D H C H C B C 26 / 48
27 Exercices... With 200 independent variables and 7 individuals, how does your correlation circle look like? 27 / 48
28 Exercices... With 200 independent variables and 7 individuals, how does your correlation circle look like? mat=matrix(rnorm(7*200,0,1),ncol=200) PCA(mat) 28 / 48
29 Link between the two representations: transition formulae Su = X Xu = λu XX Xu = X λu W (Xu) = λ(xu) WF u = λf u and since Wv = λv then F u and v are colinear Since, F u = λ and v = 1 we have: v = 1 λ F u G v = X v = 1 λ X F u u = 1 λ G v F u = Xu = 1 λ XG v F s (i) = 1 λs K k=1 x ik G s (k) G s (k) = 1 λs I x ik F s (i) i=1 29 / 48
30 Example on decathlon Dim 2 (17.37%) X400m X1500m Discus Shot.put Casarsa Parkhomenko Korkizoglou YURKOV X110m.hurdle X100m Rank Javeline Pole.vault High.jump Points Dim 1 (32.72%) Long.jump Zsivoczky Smith Macey Pogorelov SEBRLE MARTINEAU HERNU KARPOV CLAY Turi Terek BOURGUIGNON Uldal Barras McMULLEN DecastarOlympicG Schoenbeck Bernard Karlivans BARRAS Qi Hernu BERNARDOjaniemi Smirnov Gomez ZSIVOCZKY Schwarzl Lorenzo Averyanov Nool WARNERS Warners NOOL Sebrle Clay Karpov Dim 1 (32.72%) Drews Figure: Individuals and variables representations. 30 / 48
31 Supplementary information (1) For the continuous variables: projection of these supplementary variables on the dimensions Variables factor map (PCA) Dimension 2 (17.37%) X400m X110m.hurdle X100m Rank X1500m Javeline Pole.vault Discus Shot.put High.jump Points Long.jump Dimension 1 (32.72%) For the individuals: projection Supplementary information do not participate to create the axes 31 / 48
32 Supplementary information (2) How to deal with (supplementary) categorical variables? X100m Long.jump Shot.put High.jump Competition HERNU Decastar BARRAS Decastar NOOL Decastar BOURGUIGNON Decastar Sebrle OlympicG Clay OlympicG X100m Long.jump Shot.put High.jump HERNU BARRAS NOOL BOURGUIGNON Sebrle Clay Decastar Olympic.G The categories are projected at the barycenter of the individuals who take the categories 32 / 48
33 Supplementary information (3) Individuals factor map (PCA) Decastar OlympicG Dimension 2 (17.37%) Casarsa Parkhomenko Korkizoglou YURKOV Zsivoczky Smith Macey Pogorelov SEBRLE MARTINEAU HERNU KARPOV CLAY Turi Terek BOURGUIGNON Uldal Barras McMULLEN Decastar Schoenbeck OlympicG Bernard Karlivans BARRAS Qi Hernu BERNARD Ojaniemi Smirnov Gomez ZSIVOCZKY Schwarzl Lorenzo Nool WARNERS Averyanov Warners NOOL Drews Sebrle Clay Karpov Dimension 1 (32.72%) Figure: Projection of supplementary variables. 33 / 48
34 Confidence ellipses Individuals factor map (PCA) Decastar OlympicG Dimension 2 (17.37%) Casarsa Parkhomenko Korkizoglou YURKOV Zsivoczky Smith Macey Pogorelov SEBRLE MARTINEAU HERNU KARPOV CLAY Turi Terek BOURGUIGNON Uldal Barras McMULLEN Decastar Schoenbeck OlympicG Bernard Karlivans BARRAS Qi Hernu BERNARD Ojaniemi Smirnov Gomez ZSIVOCZKY Schwarzl Lorenzo Nool WARNERS Averyanov Warners NOOL Drews Sebrle Clay Karpov Dimension 1 (32.72%) Figure: Confidence ellipses around the barycenter of each category. 34 / 48
35 Number of dimensions? Percentage of variance explained by each axis: information brought by the dimension Quality of the approximation: λs K s λs Dimensionality Reduction implies Information Loss Q s Number of components to retain? Retain much of the variability in our data (the other components are noise) Bar plot of the eigenvalues: scree test Test on eigenvalues, confidence interval, / 48
36 Percentage of variance obtained under independence Is there a structure on my data? nr=50 nc=8 iner=rep(0,1000) for (i in 1:1000) { mat=matrix(rnorm(nr*nc,0,1),ncol=nc) iner[i]=pca(mat,graph=f)$eig[2,3] } quantile(iner,0.95) 36 / 48
37 Percentage of variance obtained under independence Is there a structure on my data? Number of variables nbind Table: 95 % quantile inertia on the two first dimensions of PCA on data with independent variables 37 / 48
38 Percentage of variance obtained under independence Number of variables nbind Table: 95 % quantile inertia on the two first dimensions of PCA on data with independent variables 38 / 48
39 Quality of the representation: cos 2 For the variables: only well projected variables (high cos 2 between the variable and its projection) can be interpreted! For the individuals: (same idea) distance between individuals can only be interpreted for well projected individuals res.pca$ind$cos Dim.1 Dim.2 Sebrle Clay Karpov / 48
40 Contribution Contribution to the inertia to create the axis: For the individuals: Ctr s (i) = F s 2 (i) = F s 2 (i) I i=1 F s 2 (i) λ s Individuals with large coordinate contribute the most to the construction of the axis round(res.pca$ind$contrib,2) Dim.1 Dim.2 Sebrle Clay Karpov For the variables: Ctr s (k) = G 2 s (k) λ s = cor(x.k,vs)2 λ s Variables highly correlated with the principal component contribute the most to the construction of the dimension 40 / 48
41 Description of the dimensions (1) By the quantitative variables: The correlation between each variable and the coordinate of the individuals (principal components) on the axis s is calculated The correlation coefficients are sorted and significant ones are given $Dim.1 $Dim.2 $Dim.1$quanti $Dim.2$quanti Dim.1 Dim.2 Points 0.96 Discus 0.61 Long.jump 0.74 Shot.put 0.60 Shot.put 0.62 Rank m m.hurdle m / 48
42 Description of the dimensions (2) By the categorical variables: Perform a one-way analysis of variance with the coordinates of the individuals on the axis explained by the categorical variable A F -test by variable For each category, a t-test to compare the average of the category with the general mean $Dim.1$quali P-value Competition $Dim.1$category Estimate P-value OlympicG Decastar / 48
43 Practice library(factominer) data(decathlon) res <- PCA(decathlon,quanti.sup=11:12,quali.sup=13) plot(res,habillage=13) res$eig x11() barplot(res$eig[,1],main="eigenvalues",names.arg=1:nrow(res$eig)) res$ind$coord res$ind$cos2 res$ind$contrib dimdesc(res) aa=cbind.data.frame(decathlon[,13],res$ind$coord) bb=coord.ellipse(aa,bary=true) plot.pca(res,habillage=13,ellipse=bb) #write.infile(res,file="my_factominer_results.csv") #to export a list 43 / 48
44 Application Chicken data: 43 chickens (individuals) 7407 genes (variables) One categorical variable: 6 diets corresponding to different stresses Do genes differentially expressed from one stress to another? Dimensionality reduction: with few principal components, we identify the structure in the data 44 / 48
45 Individuals factor map (PCA) Dimension 2 (9.35%) J16 J16R16 J16R5 J48 J48R24 N j48r24_6 j48_7 j48_1 j48_6 J48 j48_3 j48_4 j48_2 j48r24_9 j48r24_5 j48r24_8 J48R24 j48r24_3 j48r24_2 j48r24_1 j48r24_4 j48r24_7 N J16 J16R16 J16R Dimension 1 (19.63%) 45 / 48
46 Individuals factor map (PCA) Dimension 4 (5.87%) J16 J16R16 j16r5_3 j48r24_8 J16R5 J48 j16r5_6 J48R24 j48r24_9 j16r5_8 N j48_7j16r5_2 j16r5_7 j16r16_1 N_1 j48_1 j16_7 j16r5_5 j16_6 N_2 j48_6 j48_4 N_3 N N_6 j48r24_6 J16R5 j48r24_1 J48 N_4 j48r24_2 j16r16_5 J16 N_7 j16r16_6 j16_5 j16r16_7 j16r16_2 j16r16_3 j16_4 J16R16 j48_3 J48R24 j16r16_9 j16r16_8 j16r5_4 j16r5_1 j16_3 j48_2 j48r24_7 j48r24_3 j48r24_5 j16r16_4 j48r24_ Dimension 3 (7.24%) 46 / 48
47 Individuals factor map (PCA) Dimension 2 (9.35%) J16 J16R16 J16R5 J48 J48R24 N j48r24_6 j48_7 j48_1 j48_6 J48 j48_3 j48_4 j48_2 j48r24_9 j48r24_5 j48r24_8 J48R24 j48r24_3 j48r24_2 j48r24_1 j48r24_4 j48r24_7 N J16 J16R16 J16R Dimension 1 (19.63%) 47 / 48
48 Individuals factor map (PCA) Dimension 4 (5.87%) J16 J16R16 j16r5_3 j48r24_8 J16R5 J48 j16r5_6 J48R24 j48r24_9 j16r5_8 N j48_7j16r5_2 j16r5_7 j16r16_1 N_1 j48_1 j16_7 j16r5_5 j16_6 N_2 j48_6 j48_4 N_3 N N_6 j48r24_6 J16R5 j48r24_1 J48 N_4 j48r24_2 j16r16_5 J16 N_7 j16r16_6 j16_5 j16r16_7 j16r16_2 j16r16_3 j16_4 J16R16 j48_3 J48R24 j16r16_9 j16r16_8 j16r5_4 j16r5_1 j16_3 j48_2 j48r24_7 j48r24_3 j48r24_5 j16r16_4 j48r24_ Dimension 3 (7.24%) 48 / 48
Journal of Statistical Software
JSS Journal of Statistical Software March 2008, Volume 25, Issue 1. http://www.jstatsoft.org/ FactoMineR: An R Package for Multivariate Analysis Sébastien Lê Agrocampus Rennes Julie Josse Agrocampus Rennes
More informationJournal de la Société Française de Statistique Vol. 153 No. 2 (2012) Handling missing values in exploratory multivariate data analysis methods
Journal de la Société Française de Statistique Vol. 153 No. 2 (2012) Handling missing values in exploratory multivariate data analysis methods Titre : Gestion des données manquantes en analyse factorielle
More informationMultidimensional data and factorial methods
Multidimensional data and factorial methods Bidimensional data x 5 4 3 4 X 3 6 X 3 5 4 3 3 3 4 5 6 x Cartesian plane Multidimensional data n X x x x n X x x x n X m x m x m x nm Factorial plane Interpretation
More informationDimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationExploratory data analysis for microarray data
Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Visualization
More informationMultivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
More informationMultivariate Analysis (Slides 13)
Multivariate Analysis (Slides 13) The final topic we consider is Factor Analysis. A Factor Analysis is a mathematical approach for attempting to explain the correlation between a large set of variables
More informationEM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationPrincipal components analysis
CS229 Lecture notes Andrew Ng Part XI Principal components analysis In our discussion of factor analysis, we gave a way to model data x R n as approximately lying in some k-dimension subspace, where k
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationPrincipal Component Analysis
Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded
More informationFactor Analysis. Chapter 420. Introduction
Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.
More informationTutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
More informationDidacticiel - Études de cas
1 Topic Linear Discriminant Analysis Data Mining Tools Comparison (Tanagra, R, SAS and SPSS). Linear discriminant analysis is a popular method in domains of statistics, machine learning and pattern recognition.
More informationExploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
More informationComparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
More informationDATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
More informationHow To Understand Multivariate Models
Neil H. Timm Applied Multivariate Analysis With 42 Figures Springer Contents Preface Acknowledgments List of Tables List of Figures vii ix xix xxiii 1 Introduction 1 1.1 Overview 1 1.2 Multivariate Models
More informationOrthogonal Diagonalization of Symmetric Matrices
MATH10212 Linear Algebra Brief lecture notes 57 Gram Schmidt Process enables us to find an orthogonal basis of a subspace. Let u 1,..., u k be a basis of a subspace V of R n. We begin the process of finding
More informationSimilarity and Diagonalization. Similar Matrices
MATH022 Linear Algebra Brief lecture notes 48 Similarity and Diagonalization Similar Matrices Let A and B be n n matrices. We say that A is similar to B if there is an invertible n n matrix P such that
More informationLinear Algebra Review. Vectors
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka kosecka@cs.gmu.edu http://cs.gmu.edu/~kosecka/cs682.html Virginia de Sa Cogsci 8F Linear Algebra review UCSD Vectors The length
More informationManifold Learning Examples PCA, LLE and ISOMAP
Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationIntroduction to Principal Component Analysis: Stock Market Values
Chapter 10 Introduction to Principal Component Analysis: Stock Market Values The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from
More informationIntroduction to Principal Components and FactorAnalysis
Introduction to Principal Components and FactorAnalysis Multivariate Analysis often starts out with data involving a substantial number of correlated variables. Principal Component Analysis (PCA) is a
More informationProjects Involving Statistics (& SPSS)
Projects Involving Statistics (& SPSS) Academic Skills Advice Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs,
More information1 Introduction to Matrices
1 Introduction to Matrices In this section, important definitions and results from matrix algebra that are useful in regression analysis are introduced. While all statements below regarding the columns
More informationMultivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
More informationData Exploration Data Visualization
Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select
More informationExercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
More informationRachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA
PROC FACTOR: How to Interpret the Output of a Real-World Example Rachel J. Goldberg, Guideline Research/Atlanta, Inc., Duluth, GA ABSTRACT THE METHOD This paper summarizes a real-world example of a factor
More informationIntroduction: Overview of Kernel Methods
Introduction: Overview of Kernel Methods Statistical Data Analysis with Positive Definite Kernels Kenji Fukumizu Institute of Statistical Mathematics, ROIS Department of Statistical Science, Graduate University
More informationSALEM COMMUNITY COLLEGE Carneys Point, New Jersey 08069 COURSE SYLLABUS COVER SHEET. Action Taken (Please Check One) New Course Initiated
SALEM COMMUNITY COLLEGE Carneys Point, New Jersey 08069 COURSE SYLLABUS COVER SHEET Course Title Course Number Department Linear Algebra Mathematics MAT-240 Action Taken (Please Check One) New Course Initiated
More informationSupervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
More informationCITY UNIVERSITY LONDON. BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION
No: CITY UNIVERSITY LONDON BEng Degree in Computer Systems Engineering Part II BSc Degree in Computer Systems Engineering Part III PART 2 EXAMINATION ENGINEERING MATHEMATICS 2 (resit) EX2005 Date: August
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More information17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function
17. Inner product spaces Definition 17.1. Let V be a real vector space. An inner product on V is a function, : V V R, which is symmetric, that is u, v = v, u. bilinear, that is linear (in both factors):
More informationIris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode
Iris Sample Data Set Basic Visualization Techniques: Charts, Graphs and Maps CS598 Information Visualization Spring 2010 Many of the exploratory data techniques are illustrated with the Iris Plant data
More informationMathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationData Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
More informationCOM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3
COMP 5318 Data Exploration and Analysis Chapter 3 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping
More informationReview Jeopardy. Blue vs. Orange. Review Jeopardy
Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?
More informationCommon factor analysis
Common factor analysis This is what people generally mean when they say "factor analysis" This family of techniques uses an estimate of common variance among the original variables to generate the factor
More informationPrincipal Component Analysis
Principal Component Analysis Principle Component Analysis: A statistical technique used to examine the interrelations among a set of variables in order to identify the underlying structure of those variables.
More informationLeast-Squares Intersection of Lines
Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a
More informationExploratory Factor Analysis and Principal Components. Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016
and Principal Components Pekka Malo & Anton Frantsev 30E00500 Quantitative Empirical Research Spring 2016 Agenda Brief History and Introductory Example Factor Model Factor Equation Estimation of Loadings
More information9 Multiplication of Vectors: The Scalar or Dot Product
Arkansas Tech University MATH 934: Calculus III Dr. Marcel B Finan 9 Multiplication of Vectors: The Scalar or Dot Product Up to this point we have defined what vectors are and discussed basic notation
More informationRecall the basic property of the transpose (for any A): v A t Aw = v w, v, w R n.
ORTHOGONAL MATRICES Informally, an orthogonal n n matrix is the n-dimensional analogue of the rotation matrices R θ in R 2. When does a linear transformation of R 3 (or R n ) deserve to be called a rotation?
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler
Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?
More informationMATHEMATICAL METHODS OF STATISTICS
MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationComponent Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
More informationNonlinear Iterative Partial Least Squares Method
Numerical Methods for Determining Principal Component Analysis Abstract Factors Béchu, S., Richard-Plouet, M., Fernandez, V., Walton, J., and Fairley, N. (2016) Developments in numerical treatments for
More informationKnowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
More informationDirections for using SPSS
Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...
More information1 2 3 1 1 2 x = + x 2 + x 4 1 0 1
(d) If the vector b is the sum of the four columns of A, write down the complete solution to Ax = b. 1 2 3 1 1 2 x = + x 2 + x 4 1 0 0 1 0 1 2. (11 points) This problem finds the curve y = C + D 2 t which
More informationStatistics Chapter 2
Statistics Chapter 2 Frequency Tables A frequency table organizes quantitative data. partitions data into classes (intervals). shows how many data values are in each class. Test Score Number of Students
More informationApplied Linear Algebra I Review page 1
Applied Linear Algebra Review 1 I. Determinants A. Definition of a determinant 1. Using sum a. Permutations i. Sign of a permutation ii. Cycle 2. Uniqueness of the determinant function in terms of properties
More informationSteven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501
PRINCIPAL COMPONENTS ANALYSIS (PCA) Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 May 2008 Introduction Suppose we had measured two variables, length and width, and
More informationOverview of Factor Analysis
Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.
More informationA Survey on Outlier Detection Techniques for Credit Card Fraud Detection
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud
More informationHierarchical Clustering Analysis
Hierarchical Clustering Analysis What is Hierarchical Clustering? Hierarchical clustering is used to group similar objects into clusters. In the beginning, each row and/or column is considered a cluster.
More informationAn introduction to using Microsoft Excel for quantitative data analysis
Contents An introduction to using Microsoft Excel for quantitative data analysis 1 Introduction... 1 2 Why use Excel?... 2 3 Quantitative data analysis tools in Excel... 3 4 Entering your data... 6 5 Preparing
More informationLecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary
More informationMultivariate Statistical Inference and Applications
Multivariate Statistical Inference and Applications ALVIN C. RENCHER Department of Statistics Brigham Young University A Wiley-Interscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim
More informationIntroduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
More informationEXPLORING SPATIAL PATTERNS IN YOUR DATA
EXPLORING SPATIAL PATTERNS IN YOUR DATA OBJECTIVES Learn how to examine your data using the Geostatistical Analysis tools in ArcMap. Learn how to use descriptive statistics in ArcMap and Geoda to analyze
More informationLinear algebra and the geometry of quadratic equations. Similarity transformations and orthogonal matrices
MATH 30 Differential Equations Spring 006 Linear algebra and the geometry of quadratic equations Similarity transformations and orthogonal matrices First, some things to recall from linear algebra Two
More informationA New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set
A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set D.Napoleon Assistant Professor Department of Computer Science Bharathiar University Coimbatore
More informationIntroduction to Multivariate Analysis
Introduction to Multivariate Analysis Lecture 1 August 24, 2005 Multivariate Analysis Lecture #1-8/24/2005 Slide 1 of 30 Today s Lecture Today s Lecture Syllabus and course overview Chapter 1 (a brief
More informationMehtap Ergüven Abstract of Ph.D. Dissertation for the degree of PhD of Engineering in Informatics
INTERNATIONAL BLACK SEA UNIVERSITY COMPUTER TECHNOLOGIES AND ENGINEERING FACULTY ELABORATION OF AN ALGORITHM OF DETECTING TESTS DIMENSIONALITY Mehtap Ergüven Abstract of Ph.D. Dissertation for the degree
More informationSTATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
More informationSimple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
More informationLINEAR ALGEBRA W W L CHEN
LINEAR ALGEBRA W W L CHEN c W W L Chen, 1997, 2008 This chapter is available free to all individuals, on understanding that it is not to be used for financial gain, and may be downloaded and/or photocopied,
More informationbusiness statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar
business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel
More informationBowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition
Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Online Learning Centre Technology Step-by-Step - Excel Microsoft Excel is a spreadsheet software application
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationIntroduction to Matrix Algebra
Psychology 7291: Multivariate Statistics (Carey) 8/27/98 Matrix Algebra - 1 Introduction to Matrix Algebra Definitions: A matrix is a collection of numbers ordered by rows and columns. It is customary
More informationJava Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
More informationStatistics in Psychosocial Research Lecture 8 Factor Analysis I. Lecturer: Elizabeth Garrett-Mayer
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More informationα = u v. In other words, Orthogonal Projection
Orthogonal Projection Given any nonzero vector v, it is possible to decompose an arbitrary vector u into a component that points in the direction of v and one that points in a direction orthogonal to v
More informationFactor Analysis. Sample StatFolio: factor analysis.sgp
STATGRAPHICS Rev. 1/10/005 Factor Analysis Summary The Factor Analysis procedure is designed to extract m common factors from a set of p quantitative variables X. In many situations, a small number of
More informationSimple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
More informationDidacticiel - Études de cas
1 Topic Multiple Factor Analysis for Mixed Data with Tanagra and R (FactoMineR package). Usually, as a factor analysis approach, we use the principal component analysis (PCA) when the active variables
More informationData analysis process
Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis
More informationx + y + z = 1 2x + 3y + 4z = 0 5x + 6y + 7z = 3
Math 24 FINAL EXAM (2/9/9 - SOLUTIONS ( Find the general solution to the system of equations 2 4 5 6 7 ( r 2 2r r 2 r 5r r x + y + z 2x + y + 4z 5x + 6y + 7z 2 2 2 2 So x z + y 2z 2 and z is free. ( r
More informationData exploration with Microsoft Excel: analysing more than one variable
Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical
More informationFactor Rotations in Factor Analyses.
Factor Rotations in Factor Analyses. Hervé Abdi 1 The University of Texas at Dallas Introduction The different methods of factor analysis first extract a set a factors from a data set. These factors are
More informationPrinciple Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression
Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate
More informationSection 6.1 - Inner Products and Norms
Section 6.1 - Inner Products and Norms Definition. Let V be a vector space over F {R, C}. An inner product on V is a function that assigns, to every ordered pair of vectors x and y in V, a scalar in F,
More informationMultivariate Analysis
Table Of Contents Multivariate Analysis... 1 Overview... 1 Principal Components... 2 Factor Analysis... 5 Cluster Observations... 12 Cluster Variables... 17 Cluster K-Means... 20 Discriminant Analysis...
More informationSoft Clustering with Projections: PCA, ICA, and Laplacian
1 Soft Clustering with Projections: PCA, ICA, and Laplacian David Gleich and Leonid Zhukov Abstract In this paper we present a comparison of three projection methods that use the eigenvectors of a matrix
More informationLecture 2. Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and Constrained Optimization
Lecture 2. Marginal Functions, Average Functions, Elasticity, the Marginal Principle, and Constrained Optimization 2.1. Introduction Suppose that an economic relationship can be described by a real-valued
More informationCHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.
CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In
More information