P (B) In statistics, the Bayes theorem is often used in the following way: P (Data Unknown)P (Unknown) P (Data)



Similar documents
Bayes Theorem & Diagnostic Tests Screening Tests

Case-control studies. Alfredo Morabia

Chapter 7: Effect Modification

Bayesian Tutorial (Sheet Updated 20 March)

Use of the Chi-Square Statistic. Marie Diener-West, PhD Johns Hopkins University

RATIOS, PROPORTIONS, PERCENTAGES, AND RATES

3.2 Conditional Probability and Independent Events

C. The null hypothesis is not rejected when the alternative hypothesis is true. A. population parameters.

Measures of Prognosis. Sukon Kanchanaraksa, PhD Johns Hopkins University

STA 371G: Statistics and Modeling

Confounding in Epidemiology

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Lesson Outline Outline

The Young Epidemiology Scholars Program (YES) is supported by The Robert Wood Johnson Foundation and administered by the College Board.

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Chapter 4. Probability and Probability Distributions

"Statistical methods are objective methods by which group trends are abstracted from observations on many separate individuals." 1

GENETICS AND INSURANCE: QUANTIFYING THE IMPACT OF GENETIC INFORMATION

Summary Measures (Ratio, Proportion, Rate) Marie Diener-West, PhD Johns Hopkins University

Estimation of the Number of Lung Cancer Cases Attributable to Asbestos Exposure

This fact sheet describes how genes affect our health when they follow a well understood pattern of genetic inheritance known as autosomal recessive.

Mendelian inheritance and the

HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as...

Exercise Answers. Exercise B 2. C 3. A 4. B 5. A

The Importance of Statistics Education

Prospective, retrospective, and cross-sectional studies

HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as...

Changing Paradigms for evaluating costs and benefits of drug treatments. Deven Chauhan Health Economist Office of Health Economics London, UK

CHAPTER THREE. Key Concepts

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Knowledge Discovery and Data Mining

Case-Control Studies. Sukon Kanchanaraksa, PhD Johns Hopkins University

Conditional Probability, Independence and Bayes Theorem Class 3, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Chapter 3. Sampling. Sampling Methods

E3: PROBABILITY AND STATISTICS lecture notes

Course Notes Frequency and Effect Measures

Types of Studies. Systematic Reviews and Meta-Analyses

Week 4: Standard Error and Confidence Intervals

Measures of diagnostic accuracy: basic definitions

Basic Study Designs in Analytical Epidemiology For Observational Studies

The International EMF Collaborative s Counter-View of the Interphone Study

II. DISTRIBUTIONS distribution normal distribution. standard scores

6.3 Conditional Probability and Independence

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Laboratory 3 Type I, II Error, Sample Size, Statistical Power

Childhood leukemia and EMF

Population Genetics and Multifactorial Inheritance 2002

Lecture 9: Bayesian hypothesis testing

Bayesian Model Averaging Continual Reassessment Method BMA-CRM. Guosheng Yin and Ying Yuan. August 26, 2009

AUSTRALIAN VIETNAM VETERANS Mortality and Cancer Incidence Studies. Overarching Executive Summary

Guide to Biostatistics

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Interpretation of Somers D under four simple models

Introduction to Statistics and Quantitative Research Methods

Topic 8. Chi Square Tests

Mind on Statistics. Chapter 4

Confounding in health research

Introduction to Hypothesis Testing OPRE 6301

Lung cancer and asbestos

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

Refugees with diabetes mellitus have higher prevalence of latent tuberculosis infection

*6816* 6816 CONSENT FOR DECEASED KIDNEY DONOR ORGAN OPTIONS

Discrete Structures for Computer Science

Biostatistics and Epidemiology within the Paradigm of Public Health. Sukon Kanchanaraksa, PhD Marie Diener-West, PhD Johns Hopkins University

Inclusion and Exclusion Criteria

JESSE HUANG ( 黄 建 始 ),MD,MHPE,MPH,MBA Professor of Epidemiology Assistant President

EMPIRICAL FREQUENCY DISTRIBUTION

Descriptive Methods Ch. 6 and 7

Genetics Lecture Notes Lectures 1 2

Risk Factors for Alcoholism among Taiwanese Aborigines

Methods for Meta-analysis in Medical Research

Public Health Education

Effect measure modification & Interaction. Madhukar Pai, MD, PhD McGill University madhukar.pai@mcgill.ca

Heredity. Sarah crosses a homozygous white flower and a homozygous purple flower. The cross results in all purple flowers.

Genetic Testing, Insurance Underwriting and Adverse Selection

Overview of study designs

1 Introductory Comments. 2 Bayesian Probability

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

MEDICAL REPORT MEDICAL HISTORY QUESTIONS

CURVE FITTING LEAST SQUARES APPROXIMATION

Chi-square test Fisher s Exact test

Arkansas Tech University MATH 4033: Elementary Modern Algebra Dr. Marcel B. Finan

Genetic Mutations. Indicator 4.8: Compare the consequences of mutations in body cells with those in gametes.

5.1 Identifying the Target Parameter

5. EPIDEMIOLOGICAL STUDIES

3. Mathematical Induction

Section 12 Part 2. Chi-square test

Transcription:

22S:101 Biostatistics: J. Huang 1 Bayes Theorem For two events A and B, if we know the conditional probability P (B A) and the probability P (A), then the Bayes theorem tells that we can compute the conditional probability P (A B) as follows: P (A B) = P (B A)P (A). P (B) In statistics, the Bayes theorem is often used in the following way: P (Unknown Data) = P (Data Unknown)P (Unknown) P (Data)

22S:101 Biostatistics: J. Huang 2 Example: Accuracy of X-rays (continued) Table 1: X-ray reading data Persons without TB Persons with TB Total + X-ray 51 22 73 X-ray 1739 8 1747 Total 1790 30 1820 D 1 : tuberculosis D 2 : no tuberculosis T + : positive X-ray T : negative X-ray It is useful to find P (D 1 T + ), the probability that an individual has the disease given that he tests positive. This probability is also called the predictive value of a positive test. Remark: Here because of the problem of sampling bias, it is not correct to simply estimate P (D 1 T + ) based on the observed data, i.e., the numbers given in the table. This incorrect estimate gives 22/73, which has a large upward bias and over estimates P (D 1 T + ).

22S:101 Biostatistics: J. Huang 3 We use the Bayes theorem to find P (D 1 T + ). The Bayes theorem says: P (D 1 T + ) = P (T + D 1 )P (D 1 ). P (T + ) There are two importants things here: 1. Prior probability: P (D 1 ): the probability of TB before having the data. This is called prior probability. Usually, a judgement call has to be made as to what prior probability to use. For the present problem, it seems reasonable to use the population prevalence as the prior probability. In 1987, there were 9.3 TB cases per 100,000 population. Therefore, we specify: P (D 1 ) = 9.3 100, 000 = 0.000093. 2. Probability of the data given the model (the likelihood): P (T + D 1 ): the probability of test positive given the disease. This summarizes the information in the data, i.e., the test results in our problem. Here P (T + D 1 ) = 22 30 = 0.7333333 Then operationally, the Bayesian method is to make inference based on the posterior probability, calculated from the prior probability and the probability of the data given the model (likelihood) using the Bayes theorem.

22S:101 Biostatistics: J. Huang 4 We can calculate P (T + ) as follows: P (T + ) = P (T + D 1 ) + P (T + D 1 ) Plug in the numbers, we get P (T + ) = 22 30 Therefore, We note that = P (T + D 1 )P (D 1 ) + P (T + D 1 )P (D 1 ). 0.000093 + 51 1790 P (D 1 T + ) = (1 0.000093) = 0.02855717. 0.7333333 0.000093 0.02855717 0.00239. P (D 1 T + ) P (D 1 ) = 0.00239 0.000093 = 25.7. So the probability that an individual with a positive X-ray has TB is about 26 times greater than the probability for an individual randomly chosen from the population.

22S:101 Biostatistics: J. Huang 5 The relative risk and the odds ratio Relative risk RR = P (disease exposed) P (disease unexposed). Here the exposure can be either environmental or genetic (or both). P (death over 35 due to lung cancer male smoker) = 0.002679 P (death over 35 due to lung cancer male nonsmoker) = 0.000154 RR = 0.002679 0.000154 = 17.4.

22S:101 Biostatistics: J. Huang 6 In genetic epidemiology, the genotypic relative risk (GRR) is an important quantity to measure the genetic contribution to a disease. The GRR also gives indication on how difficult it is to identify the chromosomal regions that may harbor the disease-predisposing genes. Suppose the genotypes at a disease-predisposing locus are AA, Aa and aa, where the allele A increases the risk of disease. Then GRR 1 = P (disease Aa) P (disease aa) GRR 2 = P (disease AA) P (disease aa) For the Mendelian diseases (e.g., Cystic Fibrosis, Huntington s disease), the GGR is very high. However, for many common diseases, although there usually is a strong genetic component that contributes to elevating the risk of the disease, the GRR is often relatively low. This makes it difficult to map the genes that predispose the disease (e.g. alcoholism, autism, bipolar).

22S:101 Biostatistics: J. Huang 7 Odds ratio Odds Let p 1 = P (disease exposed). So p 1 is the probability that an individual has the disease given that he/she is exposed to a certain risk factor. The odds of having the disease given exposure to the risk factor is odds.disease(exposed) = p 1 1 p 1. Likewise, let p 2 be the probability that an individual has the disease given that he/she is not exposed to a certain risk factor. The odds of having the disease given no exposure to the risk factor is odds.disease(unexposed) = p 2 1 p 2.

22S:101 Biostatistics: J. Huang 8 Odds ratio OR = odds.disease(exposed) odds.disease(unexposed) P (disease exposed)/[1 P (disease exposed)] = P (disease unexposed)/[1 P (disease unexposed)]. The OR can also be defined as the ratio of the odds of exposure among diseased individuals and the odds of exposure among nondiseased individuals. OR = P (exposure diseased)/[1 P (exposure) diseased)] P (exposure nondiseased)/[1 P (exposure nondiseased)]. The above two definitions are mathematically equivalent. The second definition is useful in case-control studies.

22S:101 Biostatistics: J. Huang 9 Proof of the equivalence of the two expressions for OR Let D = {diseased}, D = {nondiseased} and E = {exposed}, Ē = {unexposed}. Because 1 P (disease exposed) = 1 P (D E) = P ( D E) 1 P (disease unexposed) = 1 P (D Ē) = P ( D Ē) 1 P (exposure) diseased) = 1 P (E D) = P (Ē D) 1 P (exposure nondiseased) = 1 P (E D) = P (Ē D), what we need to prove is P (D E)/P ( D E) P (E D)/P (Ē D) =. P (D Ē)/P (D Ē) P (E D)/P (Ē D) By the Bayes theorem, the left-hand side equals P (D E)/P ( D E) = P (D Ē)/P (D Ē) P (E D)P (D) P (E) / P (E D)P ( D) P (E) P (Ē D)P (D) P (Ē D)P (D) P (Ē) / P (Ē) P (E D)/P (E D) = P (Ē D)/P (Ē D) P (E D)/P (Ē D) = P (E D)/P (Ē D) = Right-hand side. This proves that the two expressions for OR are equal.

22S:101 Biostatistics: J. Huang 10 Example: In a case control study, investigators start by identifying individuals with the disease (the cases) and without the disease (the controls). They then go back in time to determine whether the exposure is question was present or absent for each individual. In a study that examines the effects of the use of oral contraceptives on the breast cancer, among the 989 women who had breast cancer, 273 had previously used oral contraceptives and 716 had not. Of the 9901 women who did not have breast cancer, 2641 had used oral contraceptives and 7269 had not. In such a study, the proportions of individuals with and without the disease are chosen by the investigator, therefore, the probability of disease in the exposed and unexposed groups cannot be estimated. However, we can estimate the probability of exposure for both cases and controls. Thus by the second form of the OR, we can calculate the OR of the disease of exposed verses unexposed.

22S:101 Biostatistics: J. Huang 11 Table 2: Breast cancer example Breast cancer Oral contraceptive Yes No Total Yes 273 2641 2914 No 716 7260 7976 Total 989 9901 10890 (273/989)/(1 273/989) OR = (2641/9901)/(1 2641/9901) = (273/989)/(716/989) (2641/9901)/(7260/9901) = 273/716 2641/7260 273 7260 = 716 2641 = 1.05.

22S:101 Biostatistics: J. Huang 12 Relationship between OR and RR If P (disease exposed) 0 and P (disease unexposed) 0, then OR = odds.disease(exposed) odds.disease(unexposed) P (disease exposed) P (disease unexposed) = RR.

22S:101 Biostatistics: J. Huang 13 Receive Operator Characteristic (ROC) Curves Sensitivity: the probability of a positive test given that it is present Specificity: the probability of a negative test given that it is not present The purpose of the ROC analysis is to find the trade-off (usually a threshold value) so that the levels of sensitivity and specificity are acceptable.

22S:101 Biostatistics: J. Huang 14 Table 3: Sensitivity and specificity of serum creatinine level for predicting transplant rejection Serum Creatinine (mg %) Sensitivity Specificity 1.2 0.939 0.13 1.3 0.939 0.203 1.4 0.909 0.281 1.5 0.818 0.380 1.6 0.758 0.461 1.7 0.727 0.535 1.8 0.636 0.649 1.9 0.636 0.711 2.0 0.545 0.766 2.1 0.485 0.773 2.2 0.485 0.803 2.3 0.394 0.811 2.4 0.394 0.843 2.5 0.363 0.870 2.6 0.333 0.891 2.7 0.333 0.894 2.8 0.333 0.896 2.9 0.303 0.909

22S:101 Biostatistics: J. Huang 15 ROC curve for serum creatinine level as predictor of transplant rejection Sensitivity 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1-Specificity