basic biostatistics ME Mass spectrometry in an omics world December 10, 2012 Stefani Thomas, Ph.D.

Similar documents
DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

II. DISTRIBUTIONS distribution normal distribution. standard scores

Biostatistics: Types of Data Analysis

Descriptive Statistics

DATA INTERPRETATION AND STATISTICS


X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Introduction to Statistics and Quantitative Research Methods

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Study Guide for the Final Exam

Statistics Review PSY379

Additional sources Compilation of sources:

Principles of Hypothesis Testing for Public Health

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

MEASURES OF VARIATION

Introduction to Quantitative Methods

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

The correlation coefficient

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Means, standard deviations and. and standard errors

Using Excel for inferential statistics

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Statistics. Measurement. Scales of Measurement 7/18/2012

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

2013 MBA Jump Start Program. Statistics Module Part 3

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

6.4 Normal Distribution

Analyzing Research Data Using Excel

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Northumberland Knowledge

Quantitative Methods for Finance

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

2. Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Chapter 5 Analysis of variance SPSS Analysis of variance

Simple Linear Regression Inference

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

WHAT IS A JOURNAL CLUB?

UNIVERSITY OF NAIROBI

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Come scegliere un test statistico

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Descriptive Statistics and Measurement Scales

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Week 4: Standard Error and Confidence Intervals

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Simple Regression Theory II 2010 Samuel L. Baker

Regression Analysis: A Complete Example

CALCULATIONS & STATISTICS

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Section 3 Part 1. Relationships between two numerical variables

Rank-Based Non-Parametric Tests

Measurement & Data Analysis. On the importance of math & measurement. Steps Involved in Doing Scientific Research. Measurement

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

Introduction to Regression and Data Analysis

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

AP Physics 1 and 2 Lab Investigations

Sample Size and Power in Clinical Trials

Describing and presenting data

Lecture 1: Review and Exploratory Data Analysis (EDA)

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

MBA 611 STATISTICS AND QUANTITATIVE METHODS

A and B This represents the probability that both events A and B occur. This can be calculated using the multiplication rules of probability.

Exploratory data analysis (Chapter 2) Fall 2011

1.5 Oneway Analysis of Variance

Descriptive Statistics

Association Between Variables

Parametric and Nonparametric: Demystifying the Terms

Chapter 7. One-way ANOVA

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Final Exam Practice Problem Answers

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Non-Inferiority Tests for Two Means using Differences

Introduction to Analysis of Variance (ANOVA) Limitations of the t-test

Module 3: Correlation and Covariance

Recall this chart that showed how most of our course would be organized:

Correlation and Regression

Correlation key concepts:

Fairfield Public Schools

NONPARAMETRIC STATISTICS 1. depend on assumptions about the underlying distribution of the data (or on the Central Limit Theorem)

Non-Inferiority Tests for One Mean

THE KRUSKAL WALLLIS TEST

Univariate Regression

How To Write A Data Analysis

Statistical tests for SPSS

UNDERSTANDING THE TWO-WAY ANOVA

430 Statistics and Financial Mathematics for Business

Chapter 4. Probability and Probability Distributions

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

Transcription:

Lecture 13. Clinical studies and basic biostatistics ME330.884 Mass spectrometry in an omics world December 10, 2012 Stefani Thomas, Ph.D. 1

Statistics and biostatistics Statistics collection, organization, analysis, and interpretation of numerical data Objective: make an inference about a population based on information contained in a sample Biostatistics application of statistical methods to medical and biological problems 2

Role of statistics in decision-making processes Analysis of data from clinical i l trials to determine efficacy of new drugs Should a mastectomy always be recommended to a patient with breast cancer? What factors increase the risk that t an individual id will develop coronary heart disease? 3

Numbers are more precise than words There are three kinds of lies: lies, damned lies, and statistics Benjamin Disraeli (British Prime Minister 1874-1880) It is easy to lie with statistics, but it is easier to lie without them Professor Frederick Mosteller (founding chairman of Harvard s statistics department, 1956) 4

1. Types of data (variables) 2. Descriptive statistics/numerical summary measures 3. Measures of dispersion/variability 4. Normal distribution and confidence intervals 5. Hypothesis testing 6. Correlation and regression analysis 7. Analysis of variance (ANOVA) 8. Experimental design 5

1. Types of data (variables) 6

Categorical data Nominal data - categories without a natural order Sex, race, country Ordinal data categories with a natural order e.g., Socioeconomic status (low, middle, high); type of bone break (hairline, simple, compound) Numbers can be assigned to specific values, but the value of the numbers is arbitrary % and proportions are used to analyze categorical data 7

Discrete data Ordered numerical data restricted to integer values e.g., Number of deaths due to AIDS in 2011; eggs laid per chicken; number of new cases of tuberculosis reported in the U.S. during a one-year period Both ordering and magnitude are important Numbers represent actual measurable quantities rather than mere labels l 8

Continuous data Ordered numerical data that can theoretically take on any value Data that represent measurable quantities but are not restricted to taking on certain specified values (such as integers) Only limiting factor for a continuous observation is the degree of accuracy with which it can be measured e.g., serum cholesterol level of a patient, concentration of a pollutant, height, weight, age, temperature 9

2. Descriptive statistics/ numerical summary measures 10

Measures of central tendency Most commonly investigated characteristic of a set of data is its center, or the point about which the observations tend to cluster 11

Mean Sum of all observations divided by n Pro: natural measure utilizing all the data Con: sensitive to extreme values 12

Median (m) Middle-most observation of ordered data Pro: insensitive to extreme values Con: determined mainly by middle points of sample Calculation 1. Order data from smallest to largest 2. If n is odd: m = (n+1)/2 largest observation 3. If n is even: m = average of the (n/2) and (n/2) +1 observation 13

Mode Observation that occurs most frequently Pro: can be used with categorical data (e.g., most popular presidential candidate) Con: less useful with continuous data Possible for data set to not have any modes or more than 1 mode 14

Relationships Symmetric distribution: mean = median = mode Skewed distribution to the right : mean>median to the left : mean<median 15

3. Measures of dispersion/ variability 16

Range Difference between the largest observation and the smallest Quick and dirty measure of variability Pro: easy to calculate Cons: Sensitive to extreme values Tends to increase with increasing n 17

Interquartile range Difference between the 25 th and the 75 th percentiles (quartiles) Encompasses middle 50% of observations Percentiles: pth percentile is the value such that X(p) percent of the data values are less than or equal to X(p) 18

Variance Quantifies the amount of variability, or spread, around the mean of the measurements Calculated by measuring the average squared distance of the observations from the mean 19

Standard deviation Square root of the variance More widely reported than the variance since the units are the same as for the data 20

Standard error of the mean (SEM) Indication of how the mean varies with different experiments measuring the same quantity If effect of random changes are significant, SEM will be higher If no change in data points as experiments are repeated, SEM is zero SEM decreases as n increases 21

Coefficient of variation Standard deviation as a percentage of the mean Useful for comparing variability of different samples, each with different means 22

4. Normal distribution and confidence intervals 23

Normal distribution Widely used continuous distribution (Gaussian distribution or bell-shaped curve) Mean = median = mode Standard normal distribution: mean = 0; s.d. = 1 Central limit theorem given certain conditions, the mean of a sufficiently large number of independent random variables, each with finite mean and variance, will be approximately normally distributed 24

Normal range Applies to normally distributed data 68% normal range = µ + 1σ 95% normal range = µ + 1.96σ 99% normal range = µ + 2.58σ 25

Confidence interval Range that describes where the true population parameter is likely to be with a certain level of confidence 26

5. Hypothesis testing 27

Procedure for hypothesis testing Hypothesis testing - an objective framework for making scientific conclusions based on a sample of data 28

Procedure for hypothesis testing: Step 1 Ask a question about a population p parameter Is the mean CD4 count for HIV(+) patients less than 400? Does smoking increase the risk of lung cancer? Is there a difference in mean serum cholesterol levels between kids who eat oatmeal and kids who eat Frosted Cheerios? 29

Procedure for hypothesis testing: Step 2 Translate the question into a hypothesis Null hypothesis (H 0 ) no difference or no effect Mean CD4 levels in HIV(+) patients = 400 (µ = 400) Alternative hypothesis (H 1 ) hypothesis that contradicts the null hypothesis; usually the research hypothesis of interest One-sided - used when interested in deviation from the null hypothesis in one direction Mean CD4 levels in HIV(+) patients < 400 (µ < 400) Two-sided - used when interested in any deviation from the null hypothesis Mean CD4 levels in HIV(+) patients 400 (µ 400) 30

Procedure for hypothesis testing: Step 3 Pick a significance level Decision H 0 Accept H 0 Reject H 0 TRUE No error Type I error FALSE Type II error No error Type I error - incorrectly rejecting H 0 when H 0 is true α - probability of Type I error; also called Significance level Type II error - incorrectly accepting H 0 when H 1 is true β - probability of Type II error Power = 1 β; probability of making the correct conclusion 31

Procedure for hypothesis testing: Steps 4-7 Collect data Calculate the test statistic Differs depending on the sampling design and the type of outcome variable Convert to p-valuel Probability of obtaining the observed data if H 0 is true Make a decision about the data based on the p- value 32

Test statistics for inferences about a population mean Z-test known variance; distribution of the test statistic under H 0 can be approximated by a normal distribution ib ti p-value for this test is given by the probability of obtaining a z- value equal to or more extreme than the computed z 33

Test statistics for inferences about a population mean t-test unknown variance p-value for this test is given by the probability of obtaining a t statistic with n-1 1 degrees of freedom equal to or more extreme than the computed t 34

Example of hypothesis testing 1. Is the mean CD4 level of HIV(+) patients less than 400, assuming that CD4 levels are normally distributed? 2. H 0 : µ=400; H 1 : µ<400 3. α = 0.05 4. Collect random sample of 10 HIV(+) patients; mean CD4 level = 305.5; standard deviation = 10 5. t = (305.5-400)/[(100)/ 10] = -2.99 6. 0.005 < p < 0.01 7. p < 0.05; therefore reject H 0; the result is significant 8. Conclusion: These data show that the mean CD4 level of HIV(+) patients is statistically significantly less than 400 (p < 0.01). 35

6. Correlation and regression analysis 36

Correlation Quantification of the degree to which two random variables are related, provided d that t the relationship is linear Advantages Maintain continuity of data Model one variable as a function of the other Disadvantages Only measures linear relationships Requires normality assumption for testing hypotheses Only useful when both variables are continuous 37

Two-way scatter plot Possible values of X are placed on the horizontal axis X is used to predict Y; X is the independent variable Possible values of Y are placed on the vertical axis Y is the dependent variable Percentages of births attended by trained health care personnel and maternal mortality rate for 20 countries 38

Population correlation coefficient (ρ) Purpose of correlation analysis is to determine whether two continuous variables (X and Y) are linearly related Correlation coefficient: i Measures linear relationship between X and Y Ranges between -1 (perfect negative correlation) and 1 (perfect positive correlation) When ρ = 0, X and Y are linearly unrelated Strong correlation does not necessarily imply causation Pearson correlation coefficient (r) is an estimate of ρ based on a sample of data; both X and Y are assumed to be normally distributed Spearman nonparametric correlation coefficient (r s ) is the non-parametric analog to the Pearson correlation; no assumptions are necessary about distributions of X and Y 39

Simple linear regression Purpose is to model the change in Y as X changes Examples of uses: Prediction (what is the predicted amount of time it will take you to get home from work given the time that you leave?) Linear association (is there a linear relationship between CD4 levels and time since infection with HIV?) 40

7. Analysis of variance (ANOVA) 41

ANOVA Used to model the means of one variable (Y) for the various levels of other variables Extension of the two-sample t-test to three or more samples Number of t-tests increases geometrically as a function of the number of groups; analysis becomes cognitively difficult; ANOVA organizes and directs the analysis Conducting a greater number of analyses greatly increases the probability of committing at least one Type I error somewhere in the analysis Performing fewer hypothesis tests reduces the experimental error rate 42

Completely randomized design; One-way ANOVA One-way implies that there is a single factor or characteristic that distinguishes the various populations from each other Applicable when the outcome variable (Y) is continuous, normally distributed, and has approximately equal variance in all treatment groups Notation: Let Y be a continuous variable under investigation in k populations. Let µ be the true means in each of the k populations. Let n be the number of subjects from each population 43

Completely randomized design; Hypotheses One-way ANOVA H 0 : µ 1 = µ 2 = µ k H 1 : µ v µ w for some v w (do not need to specify which means differ) Data layout Total sample size (n) Grand Total (T) Grand mean (y ) Data presentation Tables of means and standard deviations for each group, along with sample sizes Test statistic F-test arising from an ANOVA table yields 2-sided p-values 44

Generating an ANOVA table (F-statistic) 45

One-way ANOVA example Study investigating the effects of carbon monoxide exposure on individuals with coronary artery disease Patients (men) subjected to series of exercise tests; men recruited from 3 medical centers Before combining subjects into one large group to conduct analysis, need to examine baseline characteristics to ensure that patients from the different centers were comparable Characteristic to test: FEV 1 (forced expiratory volume in 1 sec) ANOVA table Source of Variation SS df MS F P-value Between Groups 1.394841 2 0.69742 2.730028 0.073604 Within Groups 14.81684 58 0.255463 Total 16.21168 60 46

8. Experimental design 47

Sample size determination When designing a study with the goal of testing a hypothesis, we need to know how many subjects to study Five variables must be specified 1. α: level of significance 2. One- or two-sided form of alternative hypothesis 3. δ: desired difference to detect 4. Power: 1 β (probability of detecting a difference of δ; power increases with increasing sample size) 5. σ D : standard deviation of the paired differences (typically estimated using published or pilot data) 48

Basic study designs (listed in order of increasing stringency) 1. Cross-sectional sectional study observation of a population, or a representative subset, at one specific point in time descriptive study (not longitudinal or experimental) 2. Cohort (prospective/observational) study identify cohort; measure exposure; follow for prolonged period of time; determine who develops disease; analyze to determine whether disease is associated with exposure 3. Case-control (retrospective) study identify set of patients with disease and corresponding set of controls without disease; find out retrospectively about exposure; analyze data to determine whether associations exist 49