UMEÅ UNIVERSITET Matematisk-statistiska istitutioe Multivariat dataaalys D MSTD79 PA TENTAMEN 004-0-9 LÖSNINGSFÖRSLAG TILL TENTAMEN I MATEMATISK STATISTIK Multivariat dataaalys D, 5 poäg.. Assume that x, x,..., x are iid observatios from a p-dimesioal ormal distributio with mea µ ad variace matrix. Let x ad S deote the sample mea ad sample variace matrix respectively. Derive the distributios of the followig statistics: a) x x is a liear combiatio of p-variate ormal distributed radom variables, ad has hece itself a p-variate ormal distributio. E( x ) = E( x i ) = µ = µ. V( x ) = V ( xi ) V ( xi ) Σ Σ = = =. b) Ax + x, where A is a fixed p p-matrix Ax + x is a liear combiatio of p-variate ormal distributed radom variables, ad has hece itself a p- variate ormal distributio. E Ax + x ) = Aµ + µ = ( A I) µ. V Ax + x ) = AΣΣ + Σ. ( + ( c) ( x - µ) -- ( x - µ) x N( µ, Σ). Suppose rak( ) = p, the there exists a matrix B such that BB = Σ ad u = B ( x µ ) N(0, I). We kow that the sum of squares of p idepedet stadard ormal variables has a chi square distributio with p degrees of freedom. Hece, u u = ( x µ ) ( Σ) ( x µ ) = ( x µ ) Σ ( x µ ) χ ( p). Describe the distributios (without proof) of the followig statistics: d) (-)S ( ) S W p (, Σ), a Wishart distributio with - degrees of freedom ad scale matrix. e) ( x - µ) S -- ( x - µ) ( x - µ) S -- ( x - µ) T ( p), Hotelligs T -distributio.
. Assume we are classifyig objects ito two groups, based o bivariate observatios. The observatio (x, y) has the desity fuctio f (x, y) if it comes from group, ad desity fuctio f (x, y) if it comes from group. f (x, y) is a uiform desity o all (x, y) with 0 x 3 ad 0 y 3, ad zero otherwise. f (x, y) is a bivariate ormal desity with expectatio (, ) ad the idetity matrix as covariace matrix. a) Assume you have a ew object with observatio (/, /). Classify it ito either group or group, assumig equal misclassificatio costs ad equal priors. (( x ) + ( y ) ) f ( x, y) = / 9;0 x, y 3. f ( x, y) = e, < x, y <. π Choose the distributio with highest desity i (½, ½). f (½, ½) = /9 0,. f (½, ½) 0,395 The observatio will be allocated to populatio. b) Assumig ow the prior probability for beig i group is twice the prior probability for beig i group. Use this whe classifyig a object with observatio (, ). p = p p = /3, p = /3. Choose the distributio with the highest value o p i f i (, ). p f (, ) = (/3)(/9) 0,07407. p f (, ) = (/3)f (, ) 0,05305. The observatio will be allocated to populatio. 3. a) What is the idea behid pricipal compoet aalysis? Explai how the sample pricipal compoets ca be derived. The mai idea is to get a better uderstadig of the data. This is achieved by computig ew ucorrelated variables (compoets). The coefficiets (of legth oe) i the liear combiatios are chose so that the first compoet has maximum variace. The secod compoet should have maximum variace give that it is ucorrelated with the first ad so o. Hopefully a small umber of compoets cotais most of the total variace, so that the data ca be iterpreted i a (much) smaller dimesio tha the origial oe (p). The coefficiets for the sample compoets are give by the eigevectors of the sample covariace or correlatio matrices. The coefficiets for the first compoet is foud i the eigevector correspodig to the largest eigevalue of S (or R) ad so o. b) Write dow ad explai the basic model used i factor aalysis. I factor aalysis we assume that there is a umber of u observable factors, that liearly affect the observed variables. We assume that the variace matrix of the observed variables ca be writte as Σ = LL + Ψ, where L cotais the factor loadigs, that is L tells how the uderlyig uobservable factors affect the observed variables. Ψ cotais the variace i the observed variable that is ot explaied by the factors. c) Compare pricipal compoet aalysis ad factor aalysis. What similarities are there ad which are the major differeces? Both methods try to explai a data set i a lower dimesio tha the origial. Oe way of computig factor loadigs is to use the pricipal compoet method, i. e. usig eigevalues ad eigevectors of the covariace matrix as i PCA. The major differeces are: PCA is model free, it is just a rotatio of the data. I FA we assume a well defied model. If the model is ot valid the results of the aalysis might be spurious. The directio of the trasformatio differs. I PCA we trasform the origial variables ito ew pricipal compoets. I FA we assume that uobservable factors are trasformed ito our measurable variables.
4. A food scietist was studyig the effectiveess of phosphate salts i cojuctio with vacuum packagig o the preservatio of precooked groud turkey meat durig log storage periods. I particular, the researcher wated to compare five phosphate salt treatmets to determie the particular salt treatmet that would be most effective. The five phosphate salt treatmets cosisted of () a cotrol (o phosphate salt), () sodium tripolyphosphate (STP) at a 0.3% level, (3) STP at a 0.5% level, (4) sodium ascorbate moophosphate (SAsMP) at a 0.3% level, ad (5) SAsMP at a 0.5% level. Samples of cooked groud turkey meat were vacuum packaged usig oe of the five salt treatmets, froze ad stored for 50 days at approximately -4 C. Oe complete replicatio of each of the five salt treatmets was started o each of five differet days, creatig five blocked replicates for each set of treatmets. Some of the variables measured icluded cookig loss (CKG_LOSS), ph (PH), moistess after cookig (MOIST), fat cotet (FAT), hexaal cotet (HEX), bathopheathrolie-chelateable iro cotets (NONHEM), ad the cookig time it took to reach a optimal cookig temperature (CKG_TIME). A aalysis usig MINITAB gave the followig pritout. Factor Type Levels Values TREATMENT fixed 5 ; ; 3; 4; 5 BLOCK fixed 5 ; ; 3; 4; 5 Aalysis of Variace for CKG_LOSS TREATMENT 4 06,74 6,68 3,08 0,000 BLOCK 4 57,56 4,379 7,05 0,00 Error 6 3,640,040 Total 4 96,880 S =,489 R-Sq = 83,4% R-Sq(adj) = 75,3% Aalysis of Variace for PH TREATMENT 4 0,378 0,03445 0,70 0,605 BLOCK 4 0,3306 0,0876,67 0,05 Error 6 0,7906 0,0494 Total 4,5950 S = 0,93 R-Sq = 37,3% R-Sq(adj) = 5,84% Aalysis of Variace for MOIST TREATMENT 4 34,85 8,73 5,58 0,005 BLOCK 4 7,,805,6 0,366 Error 6 4,968,56 Total 4 67,040 S =,490 R-Sq = 6,76% R-Sq(adj) = 44,3% Aalysis of Variace for FAT TREATMENT 4,8494 0,464,60 0, BLOCK 4,6673 0,6668,3 0,0 Error 6 4,695 0,887 Total 4 9,36 S = 0,53738 R-Sq = 49,44% R-Sq(adj) = 4,6%
Aalysis of Variace for HEX TREATMENT 4,4876 0,69 7,00 0,000 BLOCK 4,3846 0,3456 9,45 0,000 Error 6 0,585 0,03658 Total 4 4,45530 S = 0,949 R-Sq = 86,86% R-Sq(adj) = 80,30% Aalysis of Variace for NONHEM TREATMENT 4 47,6344,9086 4,79 0,000 BLOCK 4 3,6384 3,4096 4,4 0,06 Error 6,886 0,805 Total 4 74,544 S = 0,89774 R-Sq = 8,63% R-Sq(adj) = 73,94% Aalysis of Variace for CKG_TIME TREATMENT 4 7,040,760 0,6 0,654 BLOCK 4 48,40,060 4,5 0,06 Error 6 45,360,835 Total 4 00,640 S =,68375 R-Sq = 54,93% R-Sq(adj) = 3,39% MANOVA for TREATMENT s = 4 m =,0 = 4,0 Test DF Criterio Statistic Approx F Num Deom P Wilks' 0,0487,96 8 37 0,00 Lawley-Hotellig 9,6083 5,95 8 34 0,000 Pillai's,83896,580 8 5 0,076 Roy's 7,5600 MANOVA for BLOCK s = 4 m =,0 = 4,0 Test DF Criterio Statistic Approx F Num Deom P Wilks' 0,004 3,444 8 37 0,000 Lawley-Hotellig 3,00 3,983 8 34 0,000 Pillai's,35756,666 8 5 0,00 Roy's 9,04354 a) What model has bee used? What assumptios have to be fulfilled for the model to be valid? A MANOVA-model: y = µ + α + β + ε, where α i measures the effect of treatmet i ad β j the effect of ij i j ij block j. The radom errors ε ij are assumed to be idepedet ad ormally distributed with mea vector zero ad idetical variace matrix. Sice we have oly oe replicate of each treatmet-block combiatio, we have to assume that there are o iteractios betwee treatmets ad blocks. b) What coclusios ca be draw from the output? From the MANOVA-part we see that whe usig Wilks Λ as test variable, we ca reject the hypotheses about o treatmet effect (p =0,00) as well as about o block effect (p). Whe lookig at the idividual ANOVA s
we see that there are sigificat treatmet effects for the variables Cookig loss, Moisture, Hexaal cotet ad bathopheathrolie-chelateable iro cotets. c) What further aalyses could be cosidered? We would be iterested i fidig what treatmets affect what variables. This could be foud by computig simultaeous cofidece itervals for the differeces betwee the meas of the differet treatmets. Aother approach is to compute caocical variates ad cofidece itervals for their meas.