STATISTICAL DATA ANALYSIS IN EXCEL



Similar documents
PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

CHAPTER 14 MORE ABOUT REGRESSION

Quantization Effects in Digital Filters

SIMPLE LINEAR CORRELATION

Economic Interpretation of Regression. Theory and Applications

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

The OC Curve of Attribute Acceptance Plans

Analysis of Premium Liabilities for Australian Lines of Business

What is Candidate Sampling

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Media Mix Modeling vs. ANCOVA. An Analytical Debate

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

1 Example 1: Axis-aligned rectangles

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Traffic-light a stress test for life insurance provisions

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

Portfolio Loss Distribution

Vasicek s Model of Distribution of Losses in a Large, Homogeneous Portfolio

Calculation of Sampling Weights

Credit Limit Optimization (CLO) for Credit Cards

Scale Dependence of Overconfidence in Stock Market Volatility Forecasts

How To Evaluate A Dia Fund Suffcency

ECONOMICS OF PLANT ENERGY SAVINGS PROJECTS IN A CHANGING MARKET Douglas C White Emerson Process Management

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

1. Measuring association using correlation and regression

Gender differences in revealed risk taking: evidence from mutual fund investors

Brigid Mullany, Ph.D University of North Carolina, Charlotte

Criminal Justice System on Crime *

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

How To Calculate The Accountng Perod Of Nequalty

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

The Application of Fractional Brownian Motion in Option Pricing

Statistical Methods to Develop Rating Models

Approximating Cross-validatory Predictive Evaluation in Bayesian Latent Variables Models with Integrated IS and WAIC

14.74 Lecture 5: Health (2)

Capturing Dynamics in the Power Grid: Formulation of Dynamic State Estimation through Data Assimilation

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

Evaluating the Effects of FUNDEF on Wages and Test Scores in Brazil *

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

PRACTICE 1: MUTUAL FUNDS EVALUATION USING MATLAB.

PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIGIOUS AFFILIATION AND PARTICIPATION

Meta-analysis in Psychological Research.

Forecasting the Direction and Strength of Stock Market Movement

Measures of Fit for Logistic Regression

High Correlation between Net Promoter Score and the Development of Consumers' Willingness to Pay (Empirical Evidence from European Mobile Markets)

1.2 DISTRIBUTIONS FOR CATEGORICAL DATA

A Multistage Model of Loans and the Role of Relationships

4 Hypothesis testing in the multiple regression model

Implementation of Deutsch's Algorithm Using Mathcad

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

the Manual on the global data processing and forecasting system (GDPFS) (WMO-No.485; available at

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Estimating Total Claim Size in the Auto Insurance Industry: a Comparison between Tweedie and Zero-Adjusted Inverse Gaussian Distribution

Subcontracting Structure and Productivity in the Japanese Software Industry

Measuring portfolio loss using approximation methods

Sketching Sampled Data Streams

Regression Models for a Binary Response Using EXCEL and JMP

A statistical approach to determine Microbiologically Influenced Corrosion (MIC) Rates of underground gas pipelines.

Binomial Link Functions. Lori Murray, Phil Munz

A Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification

Transition Matrix Models of Consumer Credit Ratings

Part 1: quick summary 5. Part 2: understanding the basics of ANOVA 8

Rapid Estimation Method for Data Capacity and Spectrum Efficiency in Cellular Networks

Online Appendix for Forecasting the Equity Risk Premium: The Role of Technical Indicators

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Additional File 1 - A model-based circular binary segmentation algorithm for the analysis of array CGH data

Question 2: What is the variance and standard deviation of a dataset?

Properties of Indoor Received Signal Strength for WLAN Location Fingerprinting

Optimal Customized Pricing in Competitive Settings

Daily O-D Matrix Estimation using Cellular Probe Data

Lecture 14: Implementing CAPM

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Prediction of Disability Frequencies in Life Insurance

ADVERSE SELECTION IN INSURANCE MARKETS: POLICYHOLDER EVIDENCE FROM THE U.K. ANNUITY MARKET *

Transcription:

Mcroarray Center STATISTICAL DATA ANALYSIS IN EXCEL Lecture 6 Some Advanced Topcs Dr. Petr Nazarov 14-01-013 petr.nazarov@crp-sante.lu Statstcal data analyss n Ecel. 6. Some advanced topcs

Correcton for Multple Comparsons Please download the data from edu.sablab.net/data/ls all_data.ls Statstcal data analyss n Ecel. 6. Some advanced topcs

MULTIPLE EXPERIMENTS Correct Results and Errors False Negatve, β error False Postve, α error Probablty of an error n a multple test: 1 (0.95) number of comparsons Statstcal data analyss n Ecel. 6. Some advanced topcs 3

MULTIPLE EXPERIMENTS False Dscovery Rate False dscovery rate (FDR) FDR control s a statstcal method used n multple hypothess testng to correct for multple comparsons. In a lst of rejected hypotheses, FDR controls the epected proporton of ncorrectly rejected null hypotheses (type I errors). Concluson Populaton Condton H 0 s TRUE H 0 s FALSE Total Accept H 0 (non-sgnfcant) U T m R Reject H 0 (sgnfcant) V S R Total m 0 m m 0 m FDR = V E V + S Statstcal data analyss n Ecel. 6. Some advanced topcs 4

MULTIPLE EXPERIMENTS False Dscovery Rate Assume we need to perform k = 100 comparsons, and select mamum FDR = α = 0.05 Statstcal data analyss n Ecel. 6. Some advanced topcs 5

MULTIPLE EXPERIMENTS False Dscovery Rate Assume we need to perform k = 100 comparsons, and select mamum FDR = α = 0.05 FDR = V E V + S Epected value for FDR < α f k α m P ( k ) α mp k ) ( α k Statstcal data analyss n Ecel. 6. Some advanced topcs 6

MULTIPLE EXPERIMENTS Eample: Acute Lymphoblastc Leukema all_data.ls Acute lymphoblastc leukema (), s a form of leukema, or cancer of the whte blood cells characterzed by ecess lymphoblasts. all_data.ls contans the results of full-trancrpt proflng for patents and healthy donors usng Affymetr mcroarrays. The data were downloaded from ArrayEpress repostory and zed. The epresson values n the table are n log scale. Let us analyze these data: Calculate log-rato (logfc) for each gene Calculate the p-value based on t-test for each gene Perform the FDR-based adjustment of the p-value. Calculate the number of up and down regulated genes wth FDR<0.01 How would you take nto account logfc? log( ) logfc Eample score: score = log( adj. p. value) logfc Statstcal data analyss n Ecel. 6. Some advanced topcs 7

MULTIPLE EXPERIMENTS tetraspann 7 1.00 11.00 10.00 9.00 8.00 7.00 6.00 5.00 4.00 look for "tetraspann 7" + leukema n google Results are never perfect Statstcal data analyss n Ecel. 6. Some advanced topcs 8

Emprcal Interval Estmaton for Random Functons Statstcal data analyss n Ecel. 6. Some advanced topcs 9

INTERVAL ESTIMATIONS FOR RANDOM FUNCTIONS Sum and Square of Normal Varables Dstrbuton of sum or dfference of random varables The sum/dfference of (or more) random varables s a random varable wth mean equal to sum/dfference of the means and varance equal to SUM of the varances of the compounds. ± E y [ ± y ] = E [ ] ± E [ y ] σ = σ + σ ± y Normal dstrbuton y Dstrbuton of sum of squares on k standard random varables The sum of squares of k standard random varables s a χ wth k degree of freedom. f k = 1 1,..., k χ Normal dstrbuton wth d. f. = k What to do n more comple stuatons? y?? log ( )? Statstcal data analyss n Ecel. 6. Some advanced topcs 10

INTERVAL ESTIMATIONS FOR RANDOM FUNCTIONS Terrfyng Theory Try to solve analytcally? Smplest case. E[] = E[y] = 0 Statstcal data analyss n Ecel. 6. Some advanced topcs 11

INTERVAL ESTIMATIONS FOR RANDOM FUNCTIONS Practcal Approach Two rates where measured for a PCR eperment: epermental value (X) and control (Y). 5 replcates where performed for each. From prevous eperence we know that the error between replcates s ly dstrbuted. Q1: provde an nterval estmaton for the fold change X/Y (α=0.05) Q: provde an nterval estmaton for the log fold change log (X/Y) # Eperment Control 1 15 83 53 75 3 198 6 4 5 91 5 40 70 Mean 6. 76. StDev 1.39 11.6 Let us use a numercal smulaton Statstcal data analyss n Ecel. 6. Some advanced topcs 1

INTERVAL ESTIMATIONS FOR RANDOM FUNCTIONS Practcal Approach 1. Generate sets of 65536 random varable wth means and standard devatons correspondng to ones of epermental and control set. Mean 6. 76. StDev 1.39 11.6 In Ecel go: Tools Data Analyss: Random Number Generaton If you do not have Data Analyss tool appromate dstrbuton by sum of unform: N (, m, σ ) 1 = m + σ U ( = 1 = RAND() U() ) 6 Statstcal data analyss n Ecel. 6. Some advanced topcs 13

INTERVAL ESTIMATIONS FOR RANDOM FUNCTIONS Practcal Approach 1. Generate sets of 65536 random varable wth means and standard devatons correspondng to ones of epermental and control set. Mean 6. 76. StDev 1.39 11.6 sm.m 6.088799 76.83 sm.s 1.37965 11.885. Buld the target functon. For Q1 buld X/Y X/Y.m 3.038998 X/Y.s 0.566865 mn -8.14098141 ma 7.71605 3. Study the target functon. Calculate summary, buld hstogram. 14000 1000 10000 8000 6000 4000 000 0 1 1.5.5 3 3.5 4 4.5 5 5.5 6 6.5 7 4. If you would lke to have 95% nterval, calculate.5% and 97.5% percentles. In Ecel use functon =PERCENTILE(data,0.05) X/Y [.13, 4.33 ] Statstcal data analyss n Ecel. 6. Some advanced topcs 14

INTERVAL ESTIMATIONS FOR RANDOM FUNCTIONS Practcal Approach What was a mstake n the prevous case? σ σ m = n There we spoke about predcton nterval of X/Y. Now let s produce the nterval estmaton for mean X/Y Mean 6. 76. StDev 9.57 5.03 X/Y.m.98047943 X/Y.s 0.3616818 mn.01556098 ma 4.31131109 1000 10000 8000 6000 4000 000 E[X/Y] [.55, 3.48 ].1.3.5.7.9 3.1 3.3 3.5 3.7 3.9 4.1 4.3 0 Statstcal data analyss n Ecel. 6. Some advanced topcs 15

INTERVAL ESTIMATIONS FOR RANDOM FUNCTIONS Practcal Approach Q: provde an nterval estmaton for the log fold change log(x/y) Mean 1.57105 Standard Devaton 0.113705 E[log(X/Y)] [ 1.35, 1.80 ] 1000 10000 8000 6000 4000 000 0 1 1.1 1. 1.3 1.4.1 Smulaton Normal.50% 1.3546 1.348 97.50% 1.7998 1.7939 1.5 1.6 1.7 1.8 1.9 Statstcal data analyss n Ecel. 6. Some advanced topcs 16

Goodness of Ft and Independence Statstcal data analyss n Ecel. 6. Some advanced topcs 17

TEST OF GOODNESS OF FIT Multnomal Populaton Multnomal populaton A populaton n whch each element s assgned to one and only one of several categores. The multnomal dstrbuton etends the bnomal dstrbuton from two to three or more outcomes. Contngency table = Crosstabulaton Contngency tables or crosstabulatons are used to record, summarze and analyze the relatonshp between two or more categorcal (usually) varables. The new treatment for a dsease s tested on 00 patents. The outcomes are classfed as: A patent s completely treated B dsease transforms nto a chronc form C treatment s unsuccessful In parallel the 100 patents treated wth standard methods are observed Category Epermental Control A 94 38 B 4 8 C 64 34 Sum 00 100 Statstcal data analyss n Ecel. 6. Some advanced topcs 18

TEST OF GOODNESS OF FIT Goodness of Ft Goodness of ft test A statstcal test conducted to determne whether to reject a hypotheszed probablty dstrbuton for a populaton. Model our assumpton concernng the dstrbuton, whch we would lke to test. Observed frequency frequency dstrbuton for epermentally observed data, f Epected frequency frequency dstrbuton, whch we would epect from our model, e k ( f e ) Hypotheses for the test: H 0 : the populaton follows a multnomal dstrbuton wth the probabltes, specfed by model H a : the populaton does not follow model Test statstcs for goodness of ft Statstcal data analyss n Ecel. 6. Some advanced topcs 19 χ = = 1 e χ has k 1 degree of freedom At least 5 epected must be n each category!

TEST OF GOODNESS OF FIT Eample The new treatment for a dsease s tested on 00 patents. The outcomes are classfed as: A patent s completely treated B dsease transforms nto a chronc form C treatment s unsuccessful In parallel the 100 patents treated wth standard methods are observed 1. Select the model and calculate epected frequences Let s use control group (classcal treatment) as a model, then: Category Control Model for Epected frequences control freq., e A 38 0.38 76 B 8 0.8 56 C 34 0.34 68 Sum 100 1 00 = CHISQ.DIST(χ,d.f.) = CHISQ.TEST(f,e) Epermental freq., f 94 4 64 00 Category Epermental Control A 94 38 B 4 8 C 64 34 Sum 00 100. Compare epected frequences wth the epermental ones and buld χ Category χ = k = 1 (f-e)/e A 4.63 B 3.500 C 0.35 Ch 7.998 ( f e ) e 3. Calculate p-value for χ wth d.f. = k 1 p-value = 0.018, reject H 0 Statstcal data analyss n Ecel. 6. Some advanced topcs 0

TEST OF INDEPENDENCE Goodness of Ft for Independence Test: Eample Alber's Brewery manufactures and dstrbutes three types of beer: whte, regular, and dark. In an analyss of the market segments for the three beers, the frm's market research group rased the queston of whether preferences for the three beers dffer among male and female beer drnkers. If beer preference s ndependent of the gender of the beer drnker, one advertsng campagn wll be ntated for all of Alber's beers. However, f beer preference depends on the gender of the beer drnker, the frm wll talor ts promotons to dfferent target markets. beer.ls H 0 : Beer preference s ndependent of the gender of the beer drnker H a : Beer preference s not ndependent of the gender of the beer drnker se\beer Whte Regular Dark Total Male 0 40 0 80 Female 30 30 10 70 Total 50 70 30 150 Statstcal data analyss n Ecel. 6. Some advanced topcs 1

TEST OF INDEPENDENCE Goodness of Ft for Independence Test: Eample 1. Buld model assumng ndependence se\beer Whte Regular Dark Total Male 0 40 0 80 Female 30 30 10 70 Total 50 70 30 150 Whte Regular Dark Total Model 0.3333 0.4667 0.000 1. Transfer the model nto epected frequences, multplyng model value by number n group se\beer Whte Regular Dark Total Male 6.67 37.33 16.00 80 Female 3.33 3.67 14.00 70 Total 50 70 30 150 ( Row Total )( Column j Total ) e j = Sample Sze 3. Buld χ statstcs χ n m ( f ) = j ej j χ =6.1 e j χ dstrbuton wth d.f.=(n 1)(m 1), provded that the epected frequences are 5 or more for all categores. 4. Calculate p-value p-value = 0.047, reject H 0 Statstcal data analyss n Ecel. 6. Some advanced topcs

TEST FOR CONTINUOUS DISTRIBUTIONS Test for Normalty: Eample Chemlne hres appromately 400 new employees annually for ts four plants. The personnel drector asks whether a dstrbuton apples for the populaton of apttude test scores. If such a dstrbuton can be used, the dstrbuton would be helpful n evaluatng specfc test scores; that s, scores n the upper 0%, lower 40%, and so on, could be dentfed quckly. Hence, we want to test the null hypothess that the populaton of test scores has a dstrbuton. The study wll be based on 50 results. chemlne.ls Apttude test scores 71 86 56 61 65 60 63 76 69 56 55 79 56 74 93 8 80 90 80 73 85 6 64 54 54 65 54 63 73 58 77 56 65 76 64 61 84 70 53 79 79 61 6 61 65 66 70 68 76 71 Mean 68.4 Standard Devaton 10.4141 Sample Varance 108.457 Count 50 H 0 : The populaton of test scores has a dstrbuton wth mean 68.4 and standard devaton 10.41 H a : the populaton does not have a mentoned dstrbuton Statstcal data analyss n Ecel. 6. Some advanced topcs 3

TEST FOR CONTINUOUS DISTRIBUTIONS Test for Normalty: Eample chemlne.ls Mean 68.4 Standard Devaton 10.4141 Sample Varance 108.457 Count 50 Bn Observed frequency Epected frequency 55.1 5 5 59.68 5 5 63.01 9 5 65.8 6 5 68.4 5 71.0 5 5 73.83 5 77.16 5 5 81.74 5 5 More 6 5 Total 50 50 χ = k = 1 ( f e ) e p = ncludes mean and varance d.f. = 10 1 χ = 7. χ dstrbuton wth d.f.= n p 1, where p number of estmated parameters p-value = 0.41, cannot reject H 0 Statstcal data analyss n Ecel. 6. Some advanced topcs 4

QUESTIONS? Thank you for your attenton Statstcal data analyss n Ecel. 6. Some advanced topcs 5