basic biostatistics ME Mass spectrometry in an omics world December 10, 2012 Stefani Thomas, Ph.D.

Lecture 13. Clinical studies and basic biostatistics ME330.884 Mass spectrometry in an omics world December 10, 2012 Stefani Thomas, Ph.D. 1

Statistics and biostatistics Statistics collection, organization, analysis, and interpretation of numerical data Objective: make an inference about a population based on information contained in a sample Biostatistics application of statistical methods to medical and biological problems 2

Role of statistics in decision-making processes Analysis of data from clinical i l trials to determine efficacy of new drugs Should a mastectomy always be recommended to a patient with breast cancer? What factors increase the risk that t an individual id will develop coronary heart disease? 3

Numbers are more precise than words There are three kinds of lies: lies, damned lies, and statistics Benjamin Disraeli (British Prime Minister 1874-1880) It is easy to lie with statistics, but it is easier to lie without them Professor Frederick Mosteller (founding chairman of Harvard s statistics department, 1956) 4

1. Types of data (variables) 2. Descriptive statistics/numerical summary measures 3. Measures of dispersion/variability 4. Normal distribution and confidence intervals 5. Hypothesis testing 6. Correlation and regression analysis 7. Analysis of variance (ANOVA) 8. Experimental design 5

1. Types of data (variables) 6

Categorical data Nominal data - categories without a natural order Sex, race, country Ordinal data categories with a natural order e.g., Socioeconomic status (low, middle, high); type of bone break (hairline, simple, compound) Numbers can be assigned to specific values, but the value of the numbers is arbitrary % and proportions are used to analyze categorical data 7

Discrete data Ordered numerical data restricted to integer values e.g., Number of deaths due to AIDS in 2011; eggs laid per chicken; number of new cases of tuberculosis reported in the U.S. during a one-year period Both ordering and magnitude are important Numbers represent actual measurable quantities rather than mere labels l 8

Continuous data Ordered numerical data that can theoretically take on any value Data that represent measurable quantities but are not restricted to taking on certain specified values (such as integers) Only limiting factor for a continuous observation is the degree of accuracy with which it can be measured e.g., serum cholesterol level of a patient, concentration of a pollutant, height, weight, age, temperature 9

2. Descriptive statistics/ numerical summary measures 10

Measures of central tendency Most commonly investigated characteristic of a set of data is its center, or the point about which the observations tend to cluster 11

Mean Sum of all observations divided by n Pro: natural measure utilizing all the data Con: sensitive to extreme values 12

Median (m) Middle-most observation of ordered data Pro: insensitive to extreme values Con: determined mainly by middle points of sample Calculation 1. Order data from smallest to largest 2. If n is odd: m = (n+1)/2 largest observation 3. If n is even: m = average of the (n/2) and (n/2) +1 observation 13

Mode Observation that occurs most frequently Pro: can be used with categorical data (e.g., most popular presidential candidate) Con: less useful with continuous data Possible for data set to not have any modes or more than 1 mode 14

Relationships Symmetric distribution: mean = median = mode Skewed distribution to the right : mean>median to the left : mean<median 15

3. Measures of dispersion/ variability 16

Range Difference between the largest observation and the smallest Quick and dirty measure of variability Pro: easy to calculate Cons: Sensitive to extreme values Tends to increase with increasing n 17

Interquartile range Difference between the 25 th and the 75 th percentiles (quartiles) Encompasses middle 50% of observations Percentiles: pth percentile is the value such that X(p) percent of the data values are less than or equal to X(p) 18

Variance Quantifies the amount of variability, or spread, around the mean of the measurements Calculated by measuring the average squared distance of the observations from the mean 19

Standard deviation Square root of the variance More widely reported than the variance since the units are the same as for the data 20

Standard error of the mean (SEM) Indication of how the mean varies with different experiments measuring the same quantity If effect of random changes are significant, SEM will be higher If no change in data points as experiments are repeated, SEM is zero SEM decreases as n increases 21

Coefficient of variation Standard deviation as a percentage of the mean Useful for comparing variability of different samples, each with different means 22

4. Normal distribution and confidence intervals 23

Normal distribution Widely used continuous distribution (Gaussian distribution or bell-shaped curve) Mean = median = mode Standard normal distribution: mean = 0; s.d. = 1 Central limit theorem given certain conditions, the mean of a sufficiently large number of independent random variables, each with finite mean and variance, will be approximately normally distributed 24

Normal range Applies to normally distributed data 68% normal range = µ + 1σ 95% normal range = µ + 1.96σ 99% normal range = µ + 2.58σ 25

Confidence interval Range that describes where the true population parameter is likely to be with a certain level of confidence 26

5. Hypothesis testing 27

Procedure for hypothesis testing Hypothesis testing - an objective framework for making scientific conclusions based on a sample of data 28

Procedure for hypothesis testing: Step 1 Ask a question about a population p parameter Is the mean CD4 count for HIV(+) patients less than 400? Does smoking increase the risk of lung cancer? Is there a difference in mean serum cholesterol levels between kids who eat oatmeal and kids who eat Frosted Cheerios? 29

Procedure for hypothesis testing: Step 2 Translate the question into a hypothesis Null hypothesis (H 0 ) no difference or no effect Mean CD4 levels in HIV(+) patients = 400 (µ = 400) Alternative hypothesis (H 1 ) hypothesis that contradicts the null hypothesis; usually the research hypothesis of interest One-sided - used when interested in deviation from the null hypothesis in one direction Mean CD4 levels in HIV(+) patients < 400 (µ < 400) Two-sided - used when interested in any deviation from the null hypothesis Mean CD4 levels in HIV(+) patients 400 (µ 400) 30

Procedure for hypothesis testing: Step 3 Pick a significance level Decision H 0 Accept H 0 Reject H 0 TRUE No error Type I error FALSE Type II error No error Type I error - incorrectly rejecting H 0 when H 0 is true α - probability of Type I error; also called Significance level Type II error - incorrectly accepting H 0 when H 1 is true β - probability of Type II error Power = 1 β; probability of making the correct conclusion 31

Procedure for hypothesis testing: Steps 4-7 Collect data Calculate the test statistic Differs depending on the sampling design and the type of outcome variable Convert to p-valuel Probability of obtaining the observed data if H 0 is true Make a decision about the data based on the p- value 32

Test statistics for inferences about a population mean Z-test known variance; distribution of the test statistic under H 0 can be approximated by a normal distribution ib ti p-value for this test is given by the probability of obtaining a z- value equal to or more extreme than the computed z 33

Test statistics for inferences about a population mean t-test unknown variance p-value for this test is given by the probability of obtaining a t statistic with n-1 1 degrees of freedom equal to or more extreme than the computed t 34

Example of hypothesis testing 1. Is the mean CD4 level of HIV(+) patients less than 400, assuming that CD4 levels are normally distributed? 2. H 0 : µ=400; H 1 : µ<400 3. α = 0.05 4. Collect random sample of 10 HIV(+) patients; mean CD4 level = 305.5; standard deviation = 10 5. t = (305.5-400)/[(100)/ 10] = -2.99 6. 0.005 < p < 0.01 7. p < 0.05; therefore reject H 0; the result is significant 8. Conclusion: These data show that the mean CD4 level of HIV(+) patients is statistically significantly less than 400 (p < 0.01). 35

6. Correlation and regression analysis 36

Correlation Quantification of the degree to which two random variables are related, provided d that t the relationship is linear Advantages Maintain continuity of data Model one variable as a function of the other Disadvantages Only measures linear relationships Requires normality assumption for testing hypotheses Only useful when both variables are continuous 37

Two-way scatter plot Possible values of X are placed on the horizontal axis X is used to predict Y; X is the independent variable Possible values of Y are placed on the vertical axis Y is the dependent variable Percentages of births attended by trained health care personnel and maternal mortality rate for 20 countries 38

Population correlation coefficient (ρ) Purpose of correlation analysis is to determine whether two continuous variables (X and Y) are linearly related Correlation coefficient: i Measures linear relationship between X and Y Ranges between -1 (perfect negative correlation) and 1 (perfect positive correlation) When ρ = 0, X and Y are linearly unrelated Strong correlation does not necessarily imply causation Pearson correlation coefficient (r) is an estimate of ρ based on a sample of data; both X and Y are assumed to be normally distributed Spearman nonparametric correlation coefficient (r s ) is the non-parametric analog to the Pearson correlation; no assumptions are necessary about distributions of X and Y 39

Simple linear regression Purpose is to model the change in Y as X changes Examples of uses: Prediction (what is the predicted amount of time it will take you to get home from work given the time that you leave?) Linear association (is there a linear relationship between CD4 levels and time since infection with HIV?) 40

7. Analysis of variance (ANOVA) 41

ANOVA Used to model the means of one variable (Y) for the various levels of other variables Extension of the two-sample t-test to three or more samples Number of t-tests increases geometrically as a function of the number of groups; analysis becomes cognitively difficult; ANOVA organizes and directs the analysis Conducting a greater number of analyses greatly increases the probability of committing at least one Type I error somewhere in the analysis Performing fewer hypothesis tests reduces the experimental error rate 42

Completely randomized design; One-way ANOVA One-way implies that there is a single factor or characteristic that distinguishes the various populations from each other Applicable when the outcome variable (Y) is continuous, normally distributed, and has approximately equal variance in all treatment groups Notation: Let Y be a continuous variable under investigation in k populations. Let µ be the true means in each of the k populations. Let n be the number of subjects from each population 43

Completely randomized design; Hypotheses One-way ANOVA H 0 : µ 1 = µ 2 = µ k H 1 : µ v µ w for some v w (do not need to specify which means differ) Data layout Total sample size (n) Grand Total (T) Grand mean (y ) Data presentation Tables of means and standard deviations for each group, along with sample sizes Test statistic F-test arising from an ANOVA table yields 2-sided p-values 44

Generating an ANOVA table (F-statistic) 45

One-way ANOVA example Study investigating the effects of carbon monoxide exposure on individuals with coronary artery disease Patients (men) subjected to series of exercise tests; men recruited from 3 medical centers Before combining subjects into one large group to conduct analysis, need to examine baseline characteristics to ensure that patients from the different centers were comparable Characteristic to test: FEV 1 (forced expiratory volume in 1 sec) ANOVA table Source of Variation SS df MS F P-value Between Groups 1.394841 2 0.69742 2.730028 0.073604 Within Groups 14.81684 58 0.255463 Total 16.21168 60 46

8. Experimental design 47

Sample size determination When designing a study with the goal of testing a hypothesis, we need to know how many subjects to study Five variables must be specified 1. α: level of significance 2. One- or two-sided form of alternative hypothesis 3. δ: desired difference to detect 4. Power: 1 β (probability of detecting a difference of δ; power increases with increasing sample size) 5. σ D : standard deviation of the paired differences (typically estimated using published or pilot data) 48

Basic study designs (listed in order of increasing stringency) 1. Cross-sectional sectional study observation of a population, or a representative subset, at one specific point in time descriptive study (not longitudinal or experimental) 2. Cohort (prospective/observational) study identify cohort; measure exposure; follow for prolonged period of time; determine who develops disease; analyze to determine whether disease is associated with exposure 3. Case-control (retrospective) study identify set of patients with disease and corresponding set of controls without disease; find out retrospectively about exposure; analyze data to determine whether associations exist 49