Descriptive statistical methods and comparison measures

Similar documents
Data Analysis, Research Study Design and the IRB

Biostatistics: Types of Data Analysis

Guide to Biostatistics

Exercise 1.12 (Pg )

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Exploratory data analysis (Chapter 2) Fall 2011

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Describing and presenting data

11. Analysis of Case-control Studies Logistic Regression

Descriptive Statistics

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Descriptive Statistics

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Dongfeng Li. Autumn 2010

AP Statistics Solutions to Packet 2

Appendix: Description of the DIETRON model

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Basic research methods. Basic research methods. Question: BRM.2. Question: BRM.1

Analyzing Research Data Using Excel

DATA INTERPRETATION AND STATISTICS

Statistics. Measurement. Scales of Measurement 7/18/2012

Descriptive statistics; Correlation and regression

Statistics 100 Sample Final Questions (Note: These are mostly multiple choice, for extra practice. Your Final Exam will NOT have any multiple choice!

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Introduction to Statistics and Quantitative Research Methods

PRACTICE PROBLEMS FOR BIOSTATISTICS

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

How To Write A Data Analysis

WHAT IS A JOURNAL CLUB?

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

CHAPTER THREE. Key Concepts

STA-201-TE. 5. Measures of relationship: correlation (5%) Correlation coefficient; Pearson r; correlation and causation; proportion of common variance

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Simple linear regression

Lecture 1: Review and Exploratory Data Analysis (EDA)

Means, standard deviations and. and standard errors

Foundation of Quantitative Data Analysis

Simple Predictive Analytics Curtis Seare


Erik Parner 14 September Basic Biostatistics - Day 2-21 September,

Lecture Notes Module 1

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Descriptive Statistics and Measurement Scales

Data Transforms: Natural Logarithms and Square Roots

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

Main Section. Overall Aim & Objectives

Week 1. Exploratory Data Analysis

II. DISTRIBUTIONS distribution normal distribution. standard scores

Appendix G STATISTICAL METHODS INFECTIOUS METHODS STATISTICAL ROADMAP. Prepared in Support of: CDC/NCEH Cross Sectional Assessment Study.

LOGISTIC REGRESSION ANALYSIS

Introduction to Quantitative Methods

Statistics for Sports Medicine

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Randomized trials versus observational studies

Data Exploration Data Visualization

Statistics 2014 Scoring Guidelines

Variables. Exploratory Data Analysis

Methods for Meta-analysis in Medical Research

Descriptive Statistics

6.4 Normal Distribution

General Method: Difference of Means. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n 1, n 2 ) 1.

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

DESCRIPTIVE STATISTICS - CHAPTERS 1 & 2 1

Diagrams and Graphs of Statistical Data

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Scatter Plots with Error Bars

Comparing Means in Two Populations

EXPANDING THE EVIDENCE BASE IN OUTCOMES RESEARCH: USING LINKED ELECTRONIC MEDICAL RECORDS (EMR) AND CLAIMS DATA

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Week 4: Standard Error and Confidence Intervals

Mind on Statistics. Chapter 2

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Statistics 305: Introduction to Biostatistical Methods for Health Sciences

Lean Six Sigma Analyze Phase Introduction. TECH QUALITY and PRODUCTIVITY in INDUSTRY and TECHNOLOGY

a) Find the five point summary for the home runs of the National League teams. b) What is the mean number of home runs by the American League teams?

1 Nonparametric Statistics

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

z-scores AND THE NORMAL CURVE MODEL

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Final Exam Practice Problem Answers

AP STATISTICS REVIEW (YMS Chapters 1-8)

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

STATISTICA Formula Guide: Logistic Regression. Table of Contents

SOLUTIONS TO BIOSTATISTICS PRACTICE PROBLEMS

MTH 140 Statistics Videos

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA

Common Tools for Displaying and Communicating Data for Process Improvement

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Sponsor. Novartis Generic Drug Name. Vildagliptin. Therapeutic Area of Trial. Type 2 diabetes. Approved Indication. Investigational.

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Transcription:

Department of Epidemiology and Public Health Unit of Biostatistics and Computational Sciences Descriptive statistical methods and comparison measures PD Dr. C. Schindler Swiss Tropical and Public Health Institute University of Basel christian.schindler@unibas.ch Annual meeting of the Swiss Societies of Neurophysiology, Neurology and Stroke, Lucerne, May 19 th 2011 1

Contents Tabular representations Graphical representations Comparison measures for quantitative variables (difference in means, geometric mean ratio) Comparison measures for binary variables (risk difference, relative risk, odds ratio) Comparison measures for count data (incidence rate ratio) Non-parametric comparison measures (AUC) 2

General rules for tabulary and graphical representations Tables and Figures should be self-explanatory T + F: Title T + F: Caption F: clear axis titles with indication of units F: explanation of different graphical elements (colors, symbols, line types, etc.) 3

Tabular representations 4

Table 1 (longitudinal study report) Comparison of the different groups with respect to baseline characteristics (sex, age, etc., incl. baseline of the outcome variable) Qualitative variables: relative frequencies in % + absolute frequencies Quantitative variables: mean (standard deviation) 1 median (minimum maximum) 2 or (lower upper quartile) 2 1 if QQ-Plot does not deviate systematically from a straight line 2 if QQ-Plot shows clear curvature or wave pattern 5

Statistical properties of the normal distribution ~ 2/3 of all values (in fact: 68%) µ = mean σ = standard deviation µ - 2σ µ - σ µ µ + σ µ + 2σ 2.5% ~ 95% of all values (in fact: 95.4%) 2.5% µ - 2σ µ - σ µ µ + σ µ + 2σ 6

Huang HY et al., The Effects of Vitamin C Supplementation on Serum Concentrations of Uric Acid - Results of a Randomized Controlled Trial, ARTHRITIS & RHEUMATISM Vol. 52, No. 6, June 2005, pp 1843 1847 DOI 10.1002/art.21105 7

Table 1 (cross-sectional study report) Description of the sample studied and comparison with persons not included in the sample (with respect to demographic characteristics and health-relevant variables.) Same rules as for table 1 of a longitudinal study report. 8

Alkerwi et al., Comparison of participants and non-participants to the ORISCAV-LUX populationbased study on cardiovascular risk factors in Luxembourg, BMC Medical Research Methodology 2010, http://www.biomedcentral.c om/content/pdf/1471-2288- 10-80.pdf 9

Graphical representations 10

Boxplot (box plot) Graphical representation of the distribution of a quantitative variable based on a few important measures (minimum, lower quartile, median, upper quartile, maximum). Outlying values are represented as individual points. 11

BMI in adults aged 30 to 70 years in Basel (SAPALDIA-study) 50 Body mass index 40 30 20 10 upper fence* 3. quartile (75. percentile) median 1. quartile (25. percentile) lower fence* 0 Men sex Women *lower (upper) fence: smallest (largest) observation which is still within 1.5 box lengths of the lower (upper) end of the box. 12

Number of discharges as percentage of total number of patients, by day of week Wong HJ et al., Real-time operational feedback: daily discharge rate as a novel hospital efficiency metric, Qual Saf Health Care 2010;19:1-5 doi:10.1136/qshc.2010.040832 13

Bar charts 1. Representation of the distribution of a qualitative variable or of a quantitative variable with few values (e.g. parity of a woman). Each value of the variable is assigned a bar, whose height equals the absolute or relative frequency of the value. 2. Representation of group statistics (e.g., group means of the outcome variable) or of statistics of complex observational units (e.g., regions, hospitals, etc.) 14

Bar charts representing the distribution of a categorical variable relative frequency (%) 60 50 40 30 20 10 0 Â B C D category Group 1 Group 2 Bars represent different categories (or levels) of the respective categorical variable. relaative frequency (%) 100 90 80 70 60 50 40 30 20 10 D C B Â Heights of bars are proportional to the relative frequencies of the associated categories. Group 1 Group 2 15

Representation of group means by bar charts Here, bars represent group means and error intervals are mean ±1 standard error. (68%-confidence interval). 95%-confidence intervals would be better (mean ± 2 standard error) Smith HAB et al., Nitric oxide precursors and congenital heart surgery: A randomized controlled trial of oral citrulline, J Thorac Cardiovasc Surg 2006; 132:58-65 16

Scatter plots z-score of lower extremity latency 0 20 40 60 80 0 10 20 30 40 50 z-score of upper extremity latency Scatter plots serve to visualize the association between two numerical variables (here z-scores of upper and lower extremity latencies in RRMS and SPMS-patients) 17

Comparison measures a) for quantitative data b) for binary data c) for count data 18

Comparison measures for quantitative variables 19

Differences in means Application: Comparison of different groups with respect to a) Outcome of interest at follow-up and / or b) Change in outcome variable during follow-up. Example: Effect of vitamin C on serum uric acid level. Comparison measure: Difference between the mean change in serum uric acid level in the treatment group (vitamin C supplementation) and the mean change in serum uric acid level in the placebo group. 20

Huang HY et al., The Effects of Vitamin C Supplementation on Serum Concentrations of Uric Acid: results of a randomized controlled trial, Arthritis Rheum. 2005; 52:1843-7. 21

Remarks The difference in the mean of an outcome variable between two independent samples is generally assessed using the t-test (validity condition: approximate normality and similar variability of the data in both groups or sufficiently large sample sizes.) If data have a skewed distribution (e.g., lab measurements), approximate normality of the data may often be achieved by a logarithmic transformation of the data (cf. next topic) But a data transformation is not always appropriate, e.g., if mean costs are to be compared. In this case, bootstrap methods or permutation tests may help to achieve valid statistical comparisons. 22

Geometric mean ratios In many cases, the original outcome has a skewed distribution. But, on a logarithmic scale, it becomes approximately normal. In this case, the data should first be log-transformed. Then the group means of the log-transformed data should be compared. Example: Neurofilament heavy chain protein in cerebrovascular fluid across healthy controls and different groups of MS-patients 23

NFH-protein concentration 0 50 100 150 200 controls CIS PPMS SPMS RRMS ln(nfh-protein concentration) 2.5 3 3.5 4 4.5 5 controls CIS PPMS SPMS RRMS Group median Geometric mean Controls 27.1 exp(3.30) = 27.1 CIS 32.9 exp(3.48) = 32.5 PPMS 47.8 exp(3.97) = 53.0 SPMS 51.2 exp(3.83) = 46.1 RRMS 43.4 exp(3.84) = 46.5 Group Mean Controls 3.30 CIS 3.48 PPMS 3.97 SPMS 3.83 RRMS 3.84 24

QQ-plots (of ln(nfh)) lognfh 2.5 3 3.5 4 4.5 5 HC lognfh 2.5 3 3.5 4 4.5 5 CIS 2.5 3 3.5 4 Inverse Normal 2.5 3 3.5 4 4.5 Inverse Normal lognfh 2.5 3 3.5 4 4.5 5 PPMS 3 3.5 4 4.5 5 Inverse Normal If points are close to a straight line, the distribution can be considered as approximately normal. lognfh 2.5 3 3.5 4 4.5 5 RRMS lognfh 2.5 3 3.5 4 4.5 5 SPMS 3 3.5 4 4.5 Inverse Normal 3 3.5 4 4.5 5 Inverse Normal 25

Geometric mean mathematical definition Let mean(ln(x)) denote the sample mean of a log-transformed variable ln(x). Then, after back-exponentiation, this mean turns into the so-called geometric mean of X: geometric mean of X = e mean(ln(x)) (*) If the distribution of ln(x) is approximately symmetrical, then the geometric mean of X is a good approximation of the median of X. (*) e u = exp(u) = Euler s exponential function (e = 2.71828... = Euler s number) 26

Geometric mean ratios Let mean 1 (ln(x)) = mean of ln(x) in sample 1 mean 2 (ln(x)) = mean of ln(x) in sample 2. Then, after back-exponentiation, the difference mean = mean 2 (ln(x)) mean 1 (ln(x)) turns into the so-called geometric mean ratio between the two samples e mean = e mean 2 (ln( X )) mean 1 (ln( X )) = e e mean mean 2 1 (ln( X )) (ln( X )) = GM GM 2 1 ( X ) ( X ) In many cases, this ratio is close to the ratio of medians. 27

Geometric mean ratios Group Mean log-scale Geometric mean Geometric mean ratio Mean difference log scale Controls 3.30 exp(3.30) = 27.1 1 0 CIS 3.48 exp(3.48) = 32.5 32.5 / 27.1 = 1.20 exp(x) 0.18 PPMS 3.97 exp(3.97) = 53.0 53.0 / 27.1 = 1.96 exp(x) 0.67 SPMS 3.83 exp(3.83) = 46.1 46.1 / 27.1 = 1.70 exp(x) 0.53 RRMS 3.84 exp(3.84) = 46.5 46.5 / 27.1 = 1.72 exp(x) 0.54 Digression: 95%-confidence limits of geometric means: exp [ mean log scale ± 1.96 SE( mean log scale) ] 95%-confidence limits of geometric mean rations: exp [ mean log scale ± 1.96 SE( mean log scale) ] 28

Comparison measures for binary variables 29

Binary outcome variables X 1 = Treatment was effective in patient P X 1 = 1, if P was sucessfully treated, X 1 = 0, if the result of the treatment in patient P did not meet expectations X 2 X 3 = Subject P developed cancer during follow-up X 2 = 1, if this happened with P, X 2 = 0, if P did not develop cancer during follow-up = Patient P was satisfied with treatment X 3 = 1, if P expressed satisfaction, X 3 = 0, if P was not satisfied 30

Comparison measures for binary outcome variables A) Frequency or risk difference (RD) Difference in risks (relative frequencies) between the two groups B) Relative risk (RR) Ratio of risks (relative frequencies) between the two groups C) Odds ratio (RR) Ratio of odds* between the two groups Odds = risk : 1 risk 31

Risk and Odds (examples) Risk Odds 0.1 (10%) 10 / 90 = 0.11 0.2 (20%) 20 / 80 = 0.25 0.5 (50%) 50 / 50 = 1.0 For risks < 10%, odds and risks are essentially the same 0.6 (60%) 60 / 40 = 1.5 0.8 (80%) 80 / 20 = 4.0 32

These comparison measures can be computed directly from the underlying 2 by 2 table with outcome exposed* 64 (80%) unexposed 72 (60%) without outcome 16 (20%) 48 (40%) 80 120 136 64 200 RD = 64/80-72/120 = (96 72)/120 = 0.2 OR = 64/16 : 72/48 = (64 48) / (16 72) = 2.67 RR = 64/80 : 72/120 = (64 120) / (72 80) = 1.33 * exposed can also stand for a specific treatment, in which case subjects with the control treatment are said to be unexposed. 33

Intervention group (n = 80) (95%-conf. interval) Control group (n = 120) (95%- conf. interval) Risk Difference (95%- conf. interval) p-value Successful treatment 80% (71%, 89%) 60% (49%, 71%) 20% (8%, 32%) 0.003 Satisfied patients 90% (83%, 97%) 80% (71%, 89%) 10% (<0%, 20%) 0.06 Relative Risk Successful treatment 80% (71%, 89%) 60% (49%, 71%) 1.33 (1.11, 1.60) 0.003 Satisfied patients 90% (83%, 97%) 80% (71%, 89%) 1.13 (<1.00, 1.26) 0.06 Odds Ratio Successful treatment 80% (71%, 89%) 60% (49%, 71%) 2.67 (1.38, 5.15) 0.003 Satisfied patients 90% (83%, 97%) 80% (71%, 89%) 2.25 (0.96, 5.30) 0.06 34

Why odds ratios? Odds ratios ratios are commonly used to describe associations between binary outcomes and predictor variables because: a) Unlike the relative risk, the odds ratio is a meaningful measure not only in cohort but also in case control studies. b) Logistic regression models provide effect estimates in the form of odds ratios. 35

How to interpret odds ratios? There are 3 possibilities: a) 1 < RR < OR b) OR < RR < 1 c) RR = 1 = OR Odds ratios are always farther away from 1 than the corresponding relative risks With low risks (i.e., risks < 10%), odds ratios may be interpreted as relative risks. 36

Comparison measures for count data 37

Count variables Examples Number of doctor s visits of a patient during a certain time period. Number of deaths within a specific region during a certain time period. Number of children with epilepsy manifesting in the first 5 years of life in Denmark 1979-2002 38

Incidence rate If observational units are individual persons: IR = number of events / length of the observation period If observational units are populations IR = number of events / person time observed Example: IR of epilepsy in first 5 years of life in Denmark: low birth weight: 361 / 272318 person years = 179 / 10 5 pyrs normal birth weight: 1342 / 1513527 person years = 89 / 10 5 pyrs Sun et al., Gestational Age, Birth Weight, Intrauterine Growth and Risk of Epilepsy, Am J Epidemiol 2007; 167: 262-70 39

Incidence rate If the event is unique (e.g., death), then the period of observation of a person with this event equals the time between the beginning of the observation period and the event. observation period event incomplete observation without event time complete observation without event 40

Incidence rate ratio IRR = IR in group 2 / IR in group 1 ( = 179 / 89 = 2.01 ) 95%-confidence interval (approximative)* ± 1.96 1 n 1 + n ± 1.96 2 361 1342 IRR e ( = 2.01 e 1 = 1 + 1 (1.71, 2.37) ) n 1 = number of events in group 1 n 2 = number of events in group 2 * holds if n 1 and n 2 have a Poisson-distribution 41

Adjusted and unadjusted comparison measures In observational studies, but also in randomised trials with a remaining imbalance of certain factors, differences between groups may be confounded. E.g., the difference in mean blood pressure between normal and overweight persons is confounded by age (since both weight and blood pressure tend to increase with age). Without adjustment for the influence of age, the effect of overweight on blood pressure is therefore overestimated. There exist different statistical methods by which comparison measures can be rid of such confounding influences. -> stratification, standardization, regression models 42

Non-parametric comparison measures 43

Receiver Operating Characteristic-curve (True Positive Rate) Sensitivity AUC = 0.83 Outcome: Worsening of EDSS-score by > 0.5 units over 14 years Predictor: score involving z-values of latencies from eyes and upper extremities at baseline 1-Specificity (False Positive Rate) AUC = area under the curve 44

Area under the ROC-curve The ROC-curve of X as a predictor of membership in population 2 (as opposed to population 1) has the property AUC = proportion of pairs (x 1, x 2 ) with x 1 from group 1 and x 2 from group 2 satisfying x 2 > x 1 + 0.5*(proportion of pairs (x 1, x 2 ) with x 1 from group 1 and x 2 from group 2 satisfying x 2 = x 1 ) This is an estimate of the probability that a randomly selected member of population 2 will have a higher value of X than a randomly selected member of population 1. 45

AUC > 0.5 values of X are higher in group 2 than in group 1 AUC = 0.5 X does not discriminate between the two groups AUC < 0.5 values of X are lower in group 2 than in group 1! AUC can also be applied with ordinal variables and provides a natural way of comparing such variables. Moreover, AUC has a direct link to the Wilcoxon-rank sum test. A significant result of the Wilcoxon rank sum test is equivalent to a significant difference between AUC and 0.5 46

Summary: Tabular and graphical representations of distributions Basic rule: all such representations should be self explanatory Tables: categorical variables: relative (%) and absolute frequencies (n) numerical variables: mean ± SD (if normally distributed) median + quartiles or min / max (otherwise) Figures: Boxplots for numerical variables Bar charts for categorical variables Scatter plots to display association between two numerical variables (Normal probability plot for visual assessment of degree of normality of data distribution) 47

Summary: comparison measures Numerical variables: Difference in means (data normally distributed or no other measure wanted) Geometric mean ratio (data have log-normal distribution) Binary variables: Risk difference (or frequency difference) Relative risk Odds Ratio Count data: Incidence rate ratio Numerical and ordinal data: area under the ROC-curve All comparison measures always with 95%-confidence intervals! 48

Thank you for your attention! 49