Mcroarray Center STATISTICAL DATA ANALYSIS IN EXCEL Lecture 6 Some Advanced Topcs Dr. Petr Nazarov 14-01-013 petr.nazarov@crp-sante.lu Statstcal data analyss n Ecel. 6. Some advanced topcs
Correcton for Multple Comparsons Please download the data from edu.sablab.net/data/ls all_data.ls Statstcal data analyss n Ecel. 6. Some advanced topcs
MULTIPLE EXPERIMENTS Correct Results and Errors False Negatve, β error False Postve, α error Probablty of an error n a multple test: 1 (0.95) number of comparsons Statstcal data analyss n Ecel. 6. Some advanced topcs 3
MULTIPLE EXPERIMENTS False Dscovery Rate False dscovery rate (FDR) FDR control s a statstcal method used n multple hypothess testng to correct for multple comparsons. In a lst of rejected hypotheses, FDR controls the epected proporton of ncorrectly rejected null hypotheses (type I errors). Concluson Populaton Condton H 0 s TRUE H 0 s FALSE Total Accept H 0 (non-sgnfcant) U T m R Reject H 0 (sgnfcant) V S R Total m 0 m m 0 m FDR = V E V + S Statstcal data analyss n Ecel. 6. Some advanced topcs 4
MULTIPLE EXPERIMENTS False Dscovery Rate Assume we need to perform k = 100 comparsons, and select mamum FDR = α = 0.05 Statstcal data analyss n Ecel. 6. Some advanced topcs 5
MULTIPLE EXPERIMENTS False Dscovery Rate Assume we need to perform k = 100 comparsons, and select mamum FDR = α = 0.05 FDR = V E V + S Epected value for FDR < α f k α m P ( k ) α mp k ) ( α k Statstcal data analyss n Ecel. 6. Some advanced topcs 6
MULTIPLE EXPERIMENTS Eample: Acute Lymphoblastc Leukema all_data.ls Acute lymphoblastc leukema (), s a form of leukema, or cancer of the whte blood cells characterzed by ecess lymphoblasts. all_data.ls contans the results of full-trancrpt proflng for patents and healthy donors usng Affymetr mcroarrays. The data were downloaded from ArrayEpress repostory and zed. The epresson values n the table are n log scale. Let us analyze these data: Calculate log-rato (logfc) for each gene Calculate the p-value based on t-test for each gene Perform the FDR-based adjustment of the p-value. Calculate the number of up and down regulated genes wth FDR<0.01 How would you take nto account logfc? log( ) logfc Eample score: score = log( adj. p. value) logfc Statstcal data analyss n Ecel. 6. Some advanced topcs 7
MULTIPLE EXPERIMENTS tetraspann 7 1.00 11.00 10.00 9.00 8.00 7.00 6.00 5.00 4.00 look for "tetraspann 7" + leukema n google Results are never perfect Statstcal data analyss n Ecel. 6. Some advanced topcs 8
Emprcal Interval Estmaton for Random Functons Statstcal data analyss n Ecel. 6. Some advanced topcs 9
INTERVAL ESTIMATIONS FOR RANDOM FUNCTIONS Sum and Square of Normal Varables Dstrbuton of sum or dfference of random varables The sum/dfference of (or more) random varables s a random varable wth mean equal to sum/dfference of the means and varance equal to SUM of the varances of the compounds. ± E y [ ± y ] = E [ ] ± E [ y ] σ = σ + σ ± y Normal dstrbuton y Dstrbuton of sum of squares on k standard random varables The sum of squares of k standard random varables s a χ wth k degree of freedom. f k = 1 1,..., k χ Normal dstrbuton wth d. f. = k What to do n more comple stuatons? y?? log ( )? Statstcal data analyss n Ecel. 6. Some advanced topcs 10
INTERVAL ESTIMATIONS FOR RANDOM FUNCTIONS Terrfyng Theory Try to solve analytcally? Smplest case. E[] = E[y] = 0 Statstcal data analyss n Ecel. 6. Some advanced topcs 11
INTERVAL ESTIMATIONS FOR RANDOM FUNCTIONS Practcal Approach Two rates where measured for a PCR eperment: epermental value (X) and control (Y). 5 replcates where performed for each. From prevous eperence we know that the error between replcates s ly dstrbuted. Q1: provde an nterval estmaton for the fold change X/Y (α=0.05) Q: provde an nterval estmaton for the log fold change log (X/Y) # Eperment Control 1 15 83 53 75 3 198 6 4 5 91 5 40 70 Mean 6. 76. StDev 1.39 11.6 Let us use a numercal smulaton Statstcal data analyss n Ecel. 6. Some advanced topcs 1
INTERVAL ESTIMATIONS FOR RANDOM FUNCTIONS Practcal Approach 1. Generate sets of 65536 random varable wth means and standard devatons correspondng to ones of epermental and control set. Mean 6. 76. StDev 1.39 11.6 In Ecel go: Tools Data Analyss: Random Number Generaton If you do not have Data Analyss tool appromate dstrbuton by sum of unform: N (, m, σ ) 1 = m + σ U ( = 1 = RAND() U() ) 6 Statstcal data analyss n Ecel. 6. Some advanced topcs 13
INTERVAL ESTIMATIONS FOR RANDOM FUNCTIONS Practcal Approach 1. Generate sets of 65536 random varable wth means and standard devatons correspondng to ones of epermental and control set. Mean 6. 76. StDev 1.39 11.6 sm.m 6.088799 76.83 sm.s 1.37965 11.885. Buld the target functon. For Q1 buld X/Y X/Y.m 3.038998 X/Y.s 0.566865 mn -8.14098141 ma 7.71605 3. Study the target functon. Calculate summary, buld hstogram. 14000 1000 10000 8000 6000 4000 000 0 1 1.5.5 3 3.5 4 4.5 5 5.5 6 6.5 7 4. If you would lke to have 95% nterval, calculate.5% and 97.5% percentles. In Ecel use functon =PERCENTILE(data,0.05) X/Y [.13, 4.33 ] Statstcal data analyss n Ecel. 6. Some advanced topcs 14
INTERVAL ESTIMATIONS FOR RANDOM FUNCTIONS Practcal Approach What was a mstake n the prevous case? σ σ m = n There we spoke about predcton nterval of X/Y. Now let s produce the nterval estmaton for mean X/Y Mean 6. 76. StDev 9.57 5.03 X/Y.m.98047943 X/Y.s 0.3616818 mn.01556098 ma 4.31131109 1000 10000 8000 6000 4000 000 E[X/Y] [.55, 3.48 ].1.3.5.7.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 0 Statstcal data analyss n Ecel. 6. Some advanced topcs 15
INTERVAL ESTIMATIONS FOR RANDOM FUNCTIONS Practcal Approach Q: provde an nterval estmaton for the log fold change log(x/y) Mean 1.57105 Standard Devaton 0.113705 E[log(X/Y)] [ 1.35, 1.80 ] 1000 10000 8000 6000 4000 000 0 1 1.1 1. 1.3 1.4.1 Smulaton Normal.50% 1.3546 1.348 97.50% 1.7998 1.7939 1.5 1.6 1.7 1.8 1.9 Statstcal data analyss n Ecel. 6. Some advanced topcs 16
Goodness of Ft and Independence Statstcal data analyss n Ecel. 6. Some advanced topcs 17
TEST OF GOODNESS OF FIT Multnomal Populaton Multnomal populaton A populaton n whch each element s assgned to one and only one of several categores. The multnomal dstrbuton etends the bnomal dstrbuton from two to three or more outcomes. Contngency table = Crosstabulaton Contngency tables or crosstabulatons are used to record, summarze and analyze the relatonshp between two or more categorcal (usually) varables. The new treatment for a dsease s tested on 00 patents. The outcomes are classfed as: A patent s completely treated B dsease transforms nto a chronc form C treatment s unsuccessful In parallel the 100 patents treated wth standard methods are observed Category Epermental Control A 94 38 B 4 8 C 64 34 Sum 00 100 Statstcal data analyss n Ecel. 6. Some advanced topcs 18
TEST OF GOODNESS OF FIT Goodness of Ft Goodness of ft test A statstcal test conducted to determne whether to reject a hypotheszed probablty dstrbuton for a populaton. Model our assumpton concernng the dstrbuton, whch we would lke to test. Observed frequency frequency dstrbuton for epermentally observed data, f Epected frequency frequency dstrbuton, whch we would epect from our model, e k ( f e ) Hypotheses for the test: H 0 : the populaton follows a multnomal dstrbuton wth the probabltes, specfed by model H a : the populaton does not follow model Test statstcs for goodness of ft Statstcal data analyss n Ecel. 6. Some advanced topcs 19 χ = = 1 e χ has k 1 degree of freedom At least 5 epected must be n each category!
TEST OF GOODNESS OF FIT Eample The new treatment for a dsease s tested on 00 patents. The outcomes are classfed as: A patent s completely treated B dsease transforms nto a chronc form C treatment s unsuccessful In parallel the 100 patents treated wth standard methods are observed 1. Select the model and calculate epected frequences Let s use control group (classcal treatment) as a model, then: Category Control Model for Epected frequences control freq., e A 38 0.38 76 B 8 0.8 56 C 34 0.34 68 Sum 100 1 00 = CHISQ.DIST(χ,d.f.) = CHISQ.TEST(f,e) Epermental freq., f 94 4 64 00 Category Epermental Control A 94 38 B 4 8 C 64 34 Sum 00 100. Compare epected frequences wth the epermental ones and buld χ Category χ = k = 1 (f-e)/e A 4.63 B 3.500 C 0.35 Ch 7.998 ( f e ) e 3. Calculate p-value for χ wth d.f. = k 1 p-value = 0.018, reject H 0 Statstcal data analyss n Ecel. 6. Some advanced topcs 0
TEST OF INDEPENDENCE Goodness of Ft for Independence Test: Eample Alber's Brewery manufactures and dstrbutes three types of beer: whte, regular, and dark. In an analyss of the market segments for the three beers, the frm's market research group rased the queston of whether preferences for the three beers dffer among male and female beer drnkers. If beer preference s ndependent of the gender of the beer drnker, one advertsng campagn wll be ntated for all of Alber's beers. However, f beer preference depends on the gender of the beer drnker, the frm wll talor ts promotons to dfferent target markets. beer.ls H 0 : Beer preference s ndependent of the gender of the beer drnker H a : Beer preference s not ndependent of the gender of the beer drnker se\beer Whte Regular Dark Total Male 0 40 0 80 Female 30 30 10 70 Total 50 70 30 150 Statstcal data analyss n Ecel. 6. Some advanced topcs 1
TEST OF INDEPENDENCE Goodness of Ft for Independence Test: Eample 1. Buld model assumng ndependence se\beer Whte Regular Dark Total Male 0 40 0 80 Female 30 30 10 70 Total 50 70 30 150 Whte Regular Dark Total Model 0.3333 0.4667 0.000 1. Transfer the model nto epected frequences, multplyng model value by number n group se\beer Whte Regular Dark Total Male 6.67 37.33 16.00 80 Female 3.33 3.67 14.00 70 Total 50 70 30 150 ( Row Total )( Column j Total ) e j = Sample Sze 3. Buld χ statstcs χ n m ( f ) = j ej j χ =6.1 e j χ dstrbuton wth d.f.=(n 1)(m 1), provded that the epected frequences are 5 or more for all categores. 4. Calculate p-value p-value = 0.047, reject H 0 Statstcal data analyss n Ecel. 6. Some advanced topcs
TEST FOR CONTINUOUS DISTRIBUTIONS Test for Normalty: Eample Chemlne hres appromately 400 new employees annually for ts four plants. The personnel drector asks whether a dstrbuton apples for the populaton of apttude test scores. If such a dstrbuton can be used, the dstrbuton would be helpful n evaluatng specfc test scores; that s, scores n the upper 0%, lower 40%, and so on, could be dentfed quckly. Hence, we want to test the null hypothess that the populaton of test scores has a dstrbuton. The study wll be based on 50 results. chemlne.ls Apttude test scores 71 86 56 61 65 60 63 76 69 56 55 79 56 74 93 8 80 90 80 73 85 6 64 54 54 65 54 63 73 58 77 56 65 76 64 61 84 70 53 79 79 61 6 61 65 66 70 68 76 71 Mean 68.4 Standard Devaton 10.4141 Sample Varance 108.457 Count 50 H 0 : The populaton of test scores has a dstrbuton wth mean 68.4 and standard devaton 10.41 H a : the populaton does not have a mentoned dstrbuton Statstcal data analyss n Ecel. 6. Some advanced topcs 3
TEST FOR CONTINUOUS DISTRIBUTIONS Test for Normalty: Eample chemlne.ls Mean 68.4 Standard Devaton 10.4141 Sample Varance 108.457 Count 50 Bn Observed frequency Epected frequency 55.1 5 5 59.68 5 5 63.01 9 5 65.8 6 5 68.4 5 71.0 5 5 73.83 5 77.16 5 5 81.74 5 5 More 6 5 Total 50 50 χ = k = 1 ( f e ) e p = ncludes mean and varance d.f. = 10 1 χ = 7. χ dstrbuton wth d.f.= n p 1, where p number of estmated parameters p-value = 0.41, cannot reject H 0 Statstcal data analyss n Ecel. 6. Some advanced topcs 4
QUESTIONS? Thank you for your attenton Statstcal data analyss n Ecel. 6. Some advanced topcs 5