Part II Multivariate data Jiří Militky Computer assisted statistical modeling in the textile research
Experimental data peculiarities Small samples (thumb rule- 100 points per feature) Non normal distribution Presence of heterogeneities and outliers Data oriented models creation Physical sense of data Uncertainty in model selection concentration K C=f(t) C eq time
Style of analysis Data exploration Simplification of data structures Interactive model selection Interpretation of results Depth contours :Multidimensional analog of the median
Z Primary data All data are compiled to one matrix called X (n x m) each column of X represents one feature (variable) each row of X represents one obect (i.e. observation as one point in time, or one person, one piece etc.) Obects Rows Y X Features Columns DATA MATRIX (n x m)
Data transformation I Linear centering, scaling, standardization x µ / σ Nonlinear - logaritmic transformation 1. Reduces extremes contribution. Reduces right skewness of data. 3. Stabilizes variance (remove heteroskedasticity) Rank - (values are replaced by their ranks). x ( x µ ) / σ
Linear transformation I Common PCA is based on the column centered data (covariance matrix). Standardization leads to correlation matrix R. Differences are caused by different weighting. For centered data are columns of X weighted" according to their length x i (standard deviation in original data). For standardized data are columns of X weighted" to unit length. Weights are here the same For feature in various units is suitable to use correlation matrix.
Linear transformation II Centering removes absolute term and then reduce number of variables. The configuration of data is not changed. Only origin is shifted. Standardization removes dependence on units and remove heteroskedasticity.it have influence on parameter estimates (weighted least squares). It is inappropriate for cases where are some features on the noise level (improves their importance).
Linear transformation III Centering. Standardization
Outline Basic problems due to multivariate data. Proections of correlated data Dimension reduction techniques.
virginica Dimensionality problem I setosa versicolor Basic characteristics of multivariate data is their dimension (number of elements). High dimensions bring about huge problems in their statistical analysis. Variables reduction variables have often variability on the noise level and can be therefore excluded from data (bring no information). There are some redundancies due to near linear dependencies between some variables or due to linkages arising from their physical essence. In both cases it is possible to replace the original set by reduced number of uncorrelated new variables.
Dimensionality problem II Multivariate curse number of data necessary for achieving the multivariate estimates precision is exponential function of number of variables. Empty space phenomena multivariate data are concentrated on the peripheral part of variables space. Distance problem distance between obects is often weighted by the strength of the mutual links between variables,
Multivariate exploratory analysis I For n obects (points) are defined m variables (features) expressed in cardinal scale.input data matrix X has dimension n m. It is standard that n is higher than m. Value m defines problem dimension (number of features). x.. x Aims 11 n1 x x 1.. n x x 1.. n x x 1m.. nm. (a) Assess obects similarity or clustering tendency, (b) Retrieve outliers, or features of outliers, (c) Determine linear relations between features. (d) Prove assumptions about data (normality, no correlation, homogeneity).
Multivariate exploratory analysis II Graphical display for D or 3D representation of data Identification of obects or features appearing to be outlying. Indication of structures in data as heterogeneities, multiple groups, etc.
Multivariate exploratory analysis III Most of methods for multivariate data exploration can be divided to the following categories: generalized scatter graphs symbols proections
Obekty Profiles Each obect x i is characterized by m piecewise lines. Their size is proportional to corresponding value of feature x i = 1,..., m. Znaky On x-axis is index of features. Profile is created by oining of individual lines end points. It is suitable to scale y axis (values of features). Profiles are simple and easily interpretable. It is possible to identify outliers and groups of obects with similar behavior.
Chernoff faces Humans have specialized brain areas for face recognition. For d < 0 use face elements.
Latent variables I First principal component Direction is equal to direction of maximum variability Scatter graphs in modified coordinates enable to simpler interpretation and reduce distortion or artifacts. As suitable coordinates the latent variables are used. Typical latent variables are based on principal component (PCA). This method is useful for cases where columns of X matrix are highly correlated. Second principal component
Latent variables II PCA combined with dynamic graphics (rotation of coordinates).
Multivariate geometry Date are lying in hypercube on the diagonal vector v starting from center and ending in one corner.angle between this vector and selected axis e i is given by relation cosθ = i v * e v ± 1 m For higher m is this cosine approaching to zero. Diagonal vectors are then nearly perpendicular to all axes. In scatter graphs are then points clusters oriented in the diagonal directions proected to origin and are not identifiable. * i e i =
Volume concentration Volume concentration phenomena is not visible from classical D or 3D geometry. Volume of hyper sphere with radius r in m m / dimensional space is π Vk = Γ( m / + 1) Volume of hyper cube with edge length r is r m m Vh = * r Hyper sphere inscribed in hyper-cube. Ratio of volumes is Vk Vh m / = π 0 for Γ m + m m * ( / 1) Volume of hyper-cube is then concentrated in edges and central part is nearly empty m
Volumes ratio Influence of space dimension on volume ratio of hypercube and inscribed hyper sphere. From dimension m = 8 is volume of sphere negligible in comparison with cube volume. Data in the multidimensional space are concentrated in the periphery and for covering of central part it will be necessary to have huge number of points (obects).
Multivariate normal distribution I Let we have multivariate normal distribution with mean equal to zero and unit covariance matrix. Dependence of multivariate normal distribution function value in the point x= (1, 1,1.1) on the dimension m is shown on the figure.
Multivariate normal distribution II In central area is probability of random variable occurrence very low in comparison with area of tails. The dependence of standardized multivariate normal distribution function in the point x= (,,.) is shown on the figure. The decrease of probability for higher m is clearly visible.
Multivariate normal distribution III It is well known that sum of squares of independent standardized normally distributed elements of random vector x has chí squared distribution. x m = i= 1 x i Because is mean value equal to zero is this norm equal to distance form origin. The probability of occurrence of multivariate normal random vector in sphere having zero origin and radius r is then equal to P( x r) = χ P( χ m m r)
Multivariate normal distribution IV The dependence of probability of occurrence of multivariate normal random vector in sphere having zero origin and radius r = 3 on dimension m is shown on the figure. The quick decrease towards zero is visible. Starting from m = 8 has the occurrence of individual obects in central area small likelihood. For higher dimensions will be then maority of points in tail area. The tails play here more important role Paradox of dimensionality
Data Proection I Usually, for D proection first two PC are used. The information from the last two PC can be interesting as well. These proections preserve angles and distances between obects (points). On the other hand there is here no obective criterion for revealing the hidden structures in data.. The linear proections of multivariate data (proection pursuit) satisfy to some criterion called proection index IP(C i ). The proection vectors C i, maximizing IP(C i ) under the constraints C it C i = 1 are here computed. Proection on these vectors is then C it X.
Linear transformation D vectors X in a unit circle with mean (1,1); Y = A*X, A = x matrix Y Y 1 = a a 11 1 a a 1 X * X 1 The shape and the mean is changed. Scaling (a ii elements); rotation, mirror reflection. Distances between vectors are not invariant: Y 1 -Y X 1 -X
Data Proection II It is interesting that index IP, corresponding to principal components is T T IP( C) = max( Ci SCi ) for Ci Ci = 1 S is sample covariance matrix. (can be simply robustified) C i satisfying the maximum condition is eigenvector of matrix S having i -th largest eigenvalue λ i, i = 1,. The C 1 and C are orthogonal. Index IP(C) corresponds to minimum of all proections C of logarithm of likelihood function maximum for normally distributed data. N(c T µ, c T C c). For normally distributed samples in then proection to the first two principal components optimal.
PCA limitations PCA leads to vertical axis (no discrimination) PP leads to horizontal axis (two clusters)
PCA Data Proection III Frequently, the selection of clusters in proection is the main goal. Fro these purposes the index is equal to ratio between mean inter obect distance D ans mean distance between nearest neighbors d. Some indexes are based on the pdf of data in the proection f P (x). The estimator of f P (x) ) is usually the kernel pdf estimator IP( C) = f p ( x) dx Differences from normality expressed by pdf φ(x) are included in index IP( C) = φ( x)[ f ( x) φ( x)] dx p PP
Nonlinear proection Sammon algorithm proection from original space to reduced space (having nearly the same distances between obects). Let d i * are distances between two obects in original space and d i corresponding distances in reduced space.. Target function E (to be minimized) has form E = 1 * di i< i< ( d The iterative Newton method or heuristically oriented algorithms are used. * i - d d i * i )
Comparison of proections
100 1000 PCA goals x 800 600 Selection of original variables combinations (latent variables) explaining main part of overall variance. Discovering of data structures characterizing links between features of obects Dimension reduction Removal of noise and outliers 0 0 0 40 60 80 100 10 140 x 1 Creation of optimal summary of features. (first PCA) 400 00
PC=principal component (latent variables) PC features x z z 1 Linear combination of original variables Mutual orthogonality Arrangement about importance Rotation of coordinate system Optimal low dimensional proection Explanation of maximal portion of data variance z z 1 x 1
6 PCA utilization Dimensionality reduction Multivariate normality testing Data exploration Indicator of multivariate data Data proection Special regression model 4 3 5 0 30 60 90 10 150 180 10 40 70 300 330 360 93 931 933 934 949 89 914 916 915 894 918 917 893 896 895 7 948 96 947 963 961 946 945 964 941 977 978 939 979940 938
First PC PC1= y 1 explains maximum of original data variability Overall mean of the dataset
PC= y explains maximum of not included in y 1. Second PC y 1 is orthogonal to y Overall mean of the dataset
Mathematical formalization x C centered original data expressed in deviations from means T xc = (x1 - µ 1, x - µ,..., xm - µ m,) First PC y 1 Second PC y m T y1 = V 1 xc = V1 xc y = V x = V x = 1 m = 1 T T T D ( y 1 ) = D ( V1 xc) = E [(V1 xc) (V1 xc) ] = y )= C V T T T V1 E (xc xc ) V1 = V1 C V1 C.. covariance matrix, V1.. vector of loadings for PC1, V.. vector of loadings for PC. y.. PC, V.. factor loadings T D( C T C T V y = V * xc
Properties of loadings I Normalization conditions and mutual orthogonality V 1 T V 1 = 1 T T V V =1 a V1 V Computation of loadings V 1, V,..., V m,leads to maximization of variances under equality constraints. Solution shows that V is eigenvector of covariance matrix C, corresponding to -th maximal eigenvalue λ.. Covariance matrix decomposition = 0 C = V Λ V T V is (m x m) matrix, columns are loading vectors V Λ is (m x m) diagonal matrix, covariance matrix eigenvalues λ 1 <= λ <=... λ m are diagonal elements.
Properties of loadings II Matrix V is orthogonal, i.e. V T V = E, Variance D(y ) = λ is equal to -th eigenvalue. Overall data variance is equal to sum of PC m variances tr C = λ i=1 Relative contribution of -th PC for explanation of data variability λ P = m i=1 λ
Properties of loadings III Covariances between -th PC and vector of features x C are T cov (x, y ) = cov (x, V x ) = E (x x ) V = C V = λ V C C T C Covariance between i-th feature x i and -th PC y is cov (x Ci, y ) = λ V i, V i is i-th element of vector V. Correlation coefficient r(x Ci, y ) has form C C r( x C, y ) = λ V σ x i i λ = λ V σ x i i
Properties of loadings IV Replacement of centered features x C by normalized features x N x1 - µ 1 =, σ x - µ - µ Correlation coefficient r(x Ci, y ) reduces to form x x m,..., 1 σ x σ x m m cov (x N, y ) = r(x N, y ) = v * i λ * V i * and λ * are eigenvectors and eigenvalues of correlation matrix R.
Two features example I Two features x 1 and x, with covariance matrix C and correlation matrix R, σ 1 C1 C = C σ PCA for correlation matrix Condition for eigenvalues computation 1- l r det (R - l E ) = det = 0 r 1- l After rearrangements (1 - l) -r = 0. 1 R = 1 r r 1 l - * l + 1- r = 0
Two features example II Solution of quadratic equation l - * l + 1- r [ ] + 4-4 ( 1- r ) =1 r 1 = λ = 0.5 + l 1 = 0 [ ] - 4-4 ( 1- r ) =1 r = λ = 0.5 l Eigenvectors are solution of homogeneous equations ( R - λi E ) V i = 0, i =1,
Two features example III Normalized eigenvector V 1* has form Normalized eigenvector V * has form = 1 1-1) + ( r -1 r -1) + ( r = V 1 1 1-1/ * 1 λ λ λ 1 1 - = (1- r -1) + (1- r -1) r + (1- r -1) r - r = V -1/ *
Two features example IV First PC Second PC 1 y = ( z + z ) 1 1 1 y= ( z - z1 ) 1 =( x - E( x )) / D( x ) z=( x - E( x )) / D( x ) z 1 1 1 Application of normalized features leads to independence of PC on correlation in original data The coordinate system is rotated of i.e. of 45 o. cosα = 1/
Scores plot 4 3 Acute Chronic PC scores PC 1 0-1 - Scores T Xc * V Values of PC s for all obects Reconstructed data PC 1 = (nxm)=(nxm)*(mxm) X c = T* V -3-6 -4-0 4 6 Reduction of PC number: Selection of few Pc (p) Replacing the loading matrix V (m x m) by reduced loading matrix Vf (m x p) Computation of reduced scores T f = Xc * Vf T
Reduced PC I The percentage of variability explained by the th principal component is: λ P = *100 m λ i=1 In practice, it is often the case that although hundreds of variables are measured, the first few PCs will explain almost all of the variability (information) in the data. Scree plot Bar diagram of ordered eigenvalues λ 1 <= λ <=... λ m. Often it is visible gap between important and non important PC
3.5 Scree Plot 3.0.5.0 1.5 1.0 Reduced PC II Eigenvalue.5 0.0 1 3 4 5 6 7 8 9 10 11 1 Due to use of reduced loadings there are differences between reconstructed data matrices. Centered matrix X C is decomposed on matrix of component scores T (n x k) and loading matrix V k T (k x m). Information loss i.e. error matrix O (n x m). Component Number X C = T * V T k + O
n T S( µ, y, V ) = I k (xi - µ - Vk yii ) (xi - µ - Vk yii) i=1 Bilinear regression model X = T * V Model C k has as parameters scores T, and eigenvalues V k. T T X Cr = T *V k = X C *V k V k The minimization of length of O or distance dist(x C - T V kt ) can be realized. The results are the same as maximization of variance. Residual matrix Or = Xc X cr T + O = Xc Tf * V T k = Xc *( E V k * V T k )
PC interpretation I Original matrix X c = (x 1,..., x m ) - creating n points in m dimensional feature space Score matrix (t 1,..., t m ) - creating n points in m or k dimensional space of PC. -th vector t of PC i-th vector of x Ci m m t = Vi xci x = Ci Vi t i=1 V i are elements of matrix V m, resp. V k if only k component are used. =1
PC interpretation II In feature space is t the weighted sum of vectors x Ci with weights V i. Length of this vector is d( t ) = t T t = v T X T C x α p=k*z X C v = cos α = cos α = z λ x T z ((x T x)*(z T z)) 1/ (x T x) (p T p) Proection t Pi of vectou x Ci on t is expressed as t Pi = t b, where b is slope. t T (x Ci - b t ) = 0 and b = t T t T x t Ci = t T x λ Ci
PC interpretation III Because t T t k = 0 for # k (vectors t are orthogonal) it is valid t T x Ci = t T m k=1 V Therefore b = V i. and proection vector t Pi = V i t has length p i = t T Pi t Pi Length of t vector is then. m m d( t ) = Vi p V ik = t k i=1 V = V i i λ λ = V i i λ = λ i=1 i
PC interpretation IV Contribution of each original variable to length of vector t is proportional to square of V i. Length of this vector is proportional to standard deviation of corresponding PC. Variance explained by -th PC is composed from contributions of original features and their importance is expressed by V i. Small V i means that i-th original feature has small contribution to variability of -th PC and is no important. If the row of matrix V has all small elements is the i-th feature no important for all PC.
Contribution plot Graph composed from m groups. In each group are m columns. Each group corresponds to one PC and each column represents one feature Heights of columns are related to Heights of columns in first group are standardized and their sum is 100 %. (division by sum of their lengths Ls) The same standardization (division by Ls) is used for rest of groups as well It is simple to investigate the influence of features on PC V i * λ
Correlations Relations between original features and PC are quantified by correlation coefficients between x Ci and t, r i = cosα = i (x σ i is standard deviation for i-th feature. x T Ci T Ci x t Ci ) t T t By using of normalized variables (replacement of matrix S by correlation matrix R) are correlation coefficients equal directly to partial proections r i = p i Higher r i indicates higher proections. It means that x i is close to t and contributes markedly to variance explained by -th PC. V i σ i λ = p σ i i
PCA for correlated data I Simulated data arises from 3D normal distribution with zero vector of mean values and correlation matrix r r 1 1 13 r r 1 1 3 N =500 data were generated for various paired correlation coefficient r r 13 3 1
PCA for correlated data II All correlations are zero
PCA for correlated data III All correlations are 0.9
False correlations I r 1 = H H 1 13 H Multiple correlation coefficient Partial correlation coefficients R 1 = 0 Variable x3 is then parasite and do not contribute to explanation of feature x1 variability For simulation was selected H = 0,9. r = r = H,3() 3 R = 1(,3) R 1,(3) = H H 1+ H
False correlations II Scree and contribution plots
Low paired correlations I r 1 = H H 0.01 r 13 = 0 Multiple correlation coefficient Partial correlation coefficients r3 = 1 H R 1 = (,3) 0.707 R = 0.71 R 0. 71 1,3() 1 =,(3) All variables are therefore important.
Low paired correlations II Scree and contribution plots PCA is not able to fully replace the correlation analysis
Distances in feature spaces Data vector, d-dimensions X T = (X 1,... X m ), Y T = (Y 1,... Y m ) Distance, or metric function Popular distance functions: Minkowski distance (L r metric) Manhattan (city-block) distance (L 1 norm) Euclidean distance (L norm) 1 1 1 1 1 = = = = = = m k k k m k k k r m k r k k Y X d Y X d Y X d ), ( ), ( 0, ), ( X Y Y X Y X Y X d d d = = ), ( ), ( ), ( Z Y Z X Y X d d d + i h
Basic metrics X X Manhattan (city-block) X 1 X 1 Euclidean Identical distance between two points: imagine that in 10 D! i
Other metric Ultrametric d i replaces [ d d ] max, ih h d i d ih + d h i h Four-point additive condition d d hi i d k replaces ih d h [( d + d ) ( d + d )] + d max, + h ik hk i
Invariant distances Euclidean distance is not invariant to linear transformations Y = A*X Scaling of units has strong influence on distances. Mahalanobis metric will replace A T A by inverse of the covariance matrix (1) () (1) () T (1) () Y Y = Y Y Y Y = ( ) ( ) ( (1) () ) T T ( (1) () Y Y A A Y Y ) Orthonormal matrices: A T A = I, rigid rotations. Invariance requires standardization + covariance matrix
Distances Mahalanobis distance d i = ( x i x A ) T S 1 ( x i x A ) Euclidean distance d i = ( x i x A ) T ( x i x A )
Outlying obects I Indication of outliers is sensitive on the presence of masking, outliers appear to be correct (due to covariance matrix augmentation) or swamping,correct values appear to be outlier (due to presence of outlying points).
Outlying obects II As outliers are identified these obects d i > c( p, N, α N ) For the case of multivariate normal distribution is c( p, N, α N ) Equal to quantile of chi squared distribution c( p, N, α N ) = χ (1 α / p N)
Outlying obects III For application of Mahalanobis distance approach it is necessary to known clean estimators x A a S. Robust estimator of covariance matrix can be obtained by the following ways: - M estimator - S estimator minimizing det C with constraints - Estimators minimizing volume of confidence ellipsoid. EDA analysis requires to visualization of outliers but not corruption of proections.
Simple solution Evaluation of clear subset of data 1. Selection of starting subset based on on the - Mahalanobis distance and trimming of suspicious data - Distance from multivariate median Result is subset of data with parameters x Ac S c. Calculation of residuals 1 d i = ( x i x 3. Iterative modification of clear subset to be containing the points with residuals lower that c * χ α where AC ) T S C ( x h = ( n + p +1) / c 1 = max(0,( h r) /( h + r)) c = 1+ ( p + 1) /( n p) + /( n 1 3p) c = c 1 +c i x AC )
PCA corrupted normal data Axis 1 0 90% ellipse 99% ellipse 99.9% ellipse -1 savings Rule - - -1.5-1 -0.5 0 0.5 Axis 1 Exceptions Yearly income