Likelihood Approaches for Trial Designs in Early Phase Oncology

Similar documents
PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

Bayesian Adaptive Designs for Early-Phase Oncology Trials

The CRM for ordinal and multivariate outcomes. Elizabeth Garrett-Mayer, PhD Emily Van Meter

A Predictive Probability Design Software for Phase II Cancer Clinical Trials Version 1.0.0

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

Endpoint Selection in Phase II Oncology trials

Bayesian Model Averaging CRM in Phase I Clinical Trials

Bayesian Model Averaging Continual Reassessment Method BMA-CRM. Guosheng Yin and Ying Yuan. August 26, 2009

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

BA 275 Review Problems - Week 6 (10/30/06-11/3/06) CD Lessons: 53, 54, 55, 56 Textbook: pp , ,

Correlational Research

Bayesian Phase I/II clinical trials in Oncology

ADDPLAN DF: Designing More Accurate Dose Escalation Studies in Exploratory Drug Development. An ICON White Paper

Three-Stage Phase II Clinical Trials

Multinomial and Ordinal Logistic Regression

Adaptive Design for Intra Patient Dose Escalation in Phase I Trials in Oncology

CLINICAL TRIALS: Part 2 of 2

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

A Simple Approach For Incorporating Multiple Toxicity Thresholds In Phase I Trials

False Discovery Rates

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Likelihood: Frequentist vs Bayesian Reasoning

Sample Size Planning, Calculation, and Justification

Tests for Two Proportions

Adaptive Sample Size Modification in Clinical Trials: Start Small then Ask for More?

The Trials and Tribulations of the CRM: the DFO Experience

3.4 Statistical inference for 2 populations based on two samples

Analysis Issues II. Mary Foulkes, PhD Johns Hopkins University

II. DISTRIBUTIONS distribution normal distribution. standard scores

Introduction to Hypothesis Testing OPRE 6301

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

Statistics in Medicine Research Lecture Series CSMC Fall 2014

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Pearson's Correlation Tests

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

SAS Software to Fit the Generalized Linear Model


An introduction to Phase I dual-agent dose escalation trials

Chapter 7 Notes - Inference for Single Samples. You know already for a large sample, you can invoke the CLT so:

Generalized Linear Models

Financial Risk Forecasting Chapter 8 Backtesting and stresstesting

Inference of Probability Distributions for Trust and Security applications

Binary Diagnostic Tests Two Independent Samples

Personalized Predictive Medicine and Genomic Clinical Trials

Point Biserial Correlation Tests

Detection of changes in variance using binary segmentation and optimal partitioning

Some Essential Statistics The Lure of Statistics

Guide to Biostatistics

Sequential Design of Phase II-III Cancer Trials

Lecture 9: Bayesian hypothesis testing

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Statistical Testing of Randomness Masaryk University in Brno Faculty of Informatics

Simple Linear Regression Inference

Specification of the Bayesian CRM: Model and Sample Size. Ken Cheung Department of Biostatistics, Columbia University

Linear Classification. Volker Tresp Summer 2015

Non-Inferiority Tests for Two Means using Differences

THE RAPID ENROLLMENT DESIGN FOR PHASE I CLINICIAL TRIALS

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Interim Analysis in Clinical Trials

Chapter 4: Statistical Hypothesis Testing

Two Correlated Proportions (McNemar Test)

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

Non-Inferiority Tests for Two Proportions

NONPARAMETRIC STATISTICS 1. depend on assumptions about the underlying distribution of the data (or on the Central Limit Theorem)

More details on the inputs, functionality, and output can be found below.

Fairfield Public Schools

In the general population of 0 to 4-year-olds, the annual incidence of asthma is 1.4%

Hypothesis Testing for Beginners

VI. Introduction to Logistic Regression

Sample Size and Power in Clinical Trials

Design and Analysis of Phase III Clinical Trials

Bayesian Statistical Analysis in Medical Research

Hypothesis testing - Steps

Chapter 4 Statistical Inference in Quality Control and Improvement. Statistical Quality Control (D. C. Montgomery)

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

BA 275 Review Problems - Week 5 (10/23/06-10/27/06) CD Lessons: 48, 49, 50, 51, 52 Textbook: pp

Lecture 25. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Femtocell Coverage Optimisation Using Statistical Verification

Statistical Machine Learning

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Integer Programming: Algorithms - 3

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Principles of Hypothesis Testing for Public Health

Statistics Graduate Courses

Testing Multiple Secondary Endpoints in Confirmatory Comparative Studies -- A Regulatory Perspective

Reject Inference in Credit Scoring. Jie-Men Mok

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Section 13, Part 1 ANOVA. Analysis Of Variance

Statistics 3202 Introduction to Statistical Inference for Data Analytics 4-semester-hour course

Statistical Considerations for Phase II Trials

Non-inferiority studies: non-sense or sense?

TUTORIAL on ICH E9 and Other Statistical Regulatory Guidance. Session 1: ICH E9 and E10. PSI Conference, May 2011

Final Exam Practice Problem Answers

Testing Hypotheses About Proportions

Transcription:

Likelihood Approaches for Trial Designs in Early Phase Oncology Clinical Trials Elizabeth Garrett-Mayer, PhD Cody Chiuzan, PhD Hollings Cancer Center Department of Public Health Sciences Medical University of South Carolina April 24, 2014

Practicalities of Early Phase Oncology Trials Small sample sizes Conservation of resources patients time funding Ethics early stopping protection of (sick) patients Recent Emphasis on Adaptive trial designs Bayesian? Controversies over priors CRM: Both Bayesian and Likelihood implementations

Common Problems with Oncology Clinical Trials 1 No early stopping for futility 2 No safety stopping rules in efficacy trials 3 Poor characterization of properties of phase I design

Today s Talk: Simple Likelihood Approaches for Addressing Common Problems The Likelihood Approach principles multiple looks The Sequential Probability Ratio Test (SPRT) Phase I Designs Phase II Designs

The Likelihood or Evidential Paradigm Based on the Law of Likelihood: If hypothesis A implies that the probability of observing some data X is P A (X), and hypothesis B implies that the probability is P B (X), then the observation X = x is evidence supporting A over B if P A (x) > P B (x), and the likelihood ratio, P A(x) P B (x), measures the strength of that evidence. (Hacking (1965); Royall (1997))

Likelihood Approach Determine what the data say about the parameter of interest Likelihood function: gives a picture of the data Likelihood interval(li): gives a range of reasonable values for the parameter of interest

Likelihood Ratios Take ratio of heights of L for different values of λ L(λ = 0.030) = 0.78; L(λ = 0.035) = 0.03 LR = 0.78 0.03 = 26

Key difference in likelihood vs. frequentist paradigm Consideration of the alternative hypothesis Frequestist hypothesis testing H 0 : null hypothesis H 1 : alternative hypothesis Frequentist p-values: Calculated assuming null is true Ignores the alternative hypothesis Likelihood Ratio: Compares evidence for the two specified hypotheses Acceptance of rejection of null depends on the alternative

Example Assume H 0 : λ = 0.12 vs. H 1 : λ = 0.08 What if true λ = 0.10? Simulated data, N = 300 Frequentist: p = 0.01, reject the null Likelihood: LR = 1 4, weak evidence in favor of null

Example Why? P-value looks for evidence against the null LR compares evidence for both hypotheses When the truth is in the middle, which makes more sense?

Likelihood Inference Three possible results: Strong evidence for H 0, strong evidence for H 1, or weak evidence Weak evidence: at the end of the study, there is not sufficiently strong evidence in favor of either hypothesis This can be controlled by choosing a large enough sample size But, if neither hypothesis is correct, can end up with weak evidence even if N is seemingly large Strong evidence Correct evidence: strong evidence in favor of correct hypothesis Misleading evidence: strong evidence in favor of the incorrect hypothesis Analogous to type I and type II errors How can we control these?

Misleading Evidence in Universal bound: Under H 0, P ( ) L1 k 1 L 0 k (Birnbaum, 1962; Smith, 1953) In words, the probability that the likelihood ratio exceeds k in favor of wrong hypothesis can be no larger than 1 k In certain cases, an even lower bound applies Royall, 2000 Difference in normal means Large sample size Common choices for k are 8 (strong), 10, 32 (very strong).

Implications Important result: For a sequence of independent observations, the universal bound still holds (Robbins, 1970) Implication: We can look at the data as often as desired and our probability of misleading evidence is bounded That is, if k = 10, the probability of misleading strong evidence is 10% k = 10 seems reasonable considering β = 10 20% and α = 5 10% in most studies

Sequential Probability Ratio Test This is not a new idea Abraham Wald (1945) published Sequential Tests of Statistical Hypotheses in the Annals of Mathematical Statistics (it s 70 pages long) Commonly seen in DSMB plans SPRT used to monitor secondary safety outcomes Classical SPRT assumes two boundaries: Stop if there is strong evidence for H 0 Stop if there is strong evidence for H 1 Generally setup based on A and B which are based on selected α and β levels

Wald s Figure 2 The context was not clinical trials The initial result was useful for the war effort in 1943 and essentially classified until 1945 Popular in manufacturing, item response theory, etc. Max SPRT (Kulldorf et al., 2011) with applications to medical research, specifically rare adverse events

Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Most immunotherapy trials in cancer use a two-step approach to dose finding 1 Perform an algorithmic design to identify safe doses 2 Collect immunologic data and explore it to see if there appears to be an optimal dose Optimal dose? we imagine there will be a clear plateau in the association between dose and immunoresponse unrealistic and simplistic due to small sample sizes at each dose unrealistic and simplistic due to heterogeneity across patients

Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Goal: Develop an adaptive early phase design for assessing toxicity and efficacy outcomes in cancer immunotherapy trials. Identify the optimal dose to maximize efficacy while maintaining safety. Two-stage design: Stage 1: Explore doses for safety and obtain information on immunotherapy outcomes Stage 2: Allocate patients to allowable doses with emphasis towards doses with higher efficacy Uses both: continuous (immunologic) outcomes binary toxicity information Optimize efficacy while setting a threshold on acceptable toxicity

Stage 1: Confirm Safety Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Define p 1 and p 2 as unacceptable and acceptable DLT rates Use cohorts of size m to explore selected dose levels Likelihood inference used to declare dose levels allowable based on p 1 and p 2 and observed data. Define k as the threshold of evidence required for declaring a dose to be toxic At end of Stage 1, there will be a set of doses for Stage 2. Continue to Stage 2 if two or more allowable doses. Details in Chiuzan et al. 2014

Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Likelihood method with cohorts of size 3 Example: p 1 = 0.40; p 2 = 0.15; m = 3 Require likelihood ratio 2 (in favor of p 1 ) to declare toxic. 0 or 1 DLT in 3 pts: allowable dose 2 or 3 DLTs in 3 pts: unacceptable dose (and all higher doses unacceptable) DLT 3+3 rule L(p 1 )/L(p 2 ) k = 1 k = 2 k = 8 0 acceptable LR = 0.35 acceptable acceptable weak 1 expand to 6 LR = 1.33 toxic weak weak 2 toxic LR = 5.0 toxic toxic weak 3 toxic LR = 19.0 toxic toxic toxic toxic: LR k; acceptable: LR 1 k ; weak: 1 k < LR < k

Stage 2: Adaptive randomization Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Using data from Stage 1, estimate immunologic parameter estimate at each of J allowable doses. Example: T cell persistence at 14 days. y i = % CD3 cells of patient i at 14 days compared to baseline d i = dose level for patient i Estimation is based on a standard linear regression model using a log transformation of y i : log(y i ) = β 0 + J j=1 β ji(d i = j) + e i e i N(0, σ 2 ); j β j = 0

Stage 2: Adaptive randomization Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Define p j as the estimated persistence (%) at dose j: ˆp j = e ˆβ 0 + ˆβ j Calculate the randomization probabilities π j for doses j = 1,.., J: π j = or π j = ˆp j r ˆpr ˆpj r ˆpr

Example 1: shallow slope Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Tcell persistence (% of baseline) 0 20 40 60 80 100 1 2 3 4 Dose Level

Example 1: shallow slope Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Tcell persistence (% of baseline) 0 20 40 60 80 100 1 2 3 4 Dose Level

Example 1: shallow slope Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Tcell persistence (% of baseline) 0 20 40 60 80 100 π 1= 0.14 π 2 = 0.2 π 3= 0.33 π 4= 0.33 1 2 3 4 Dose Level

Example 2: steep slope Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Tcell persistence (% of baseline) 0 20 40 60 80 100 1 2 3 4 Dose Level

Example 2: steep slope Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Tcell persistence (% of baseline) 0 20 40 60 80 100 1 2 3 4 Dose Level

Example 2: steep slope Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Tcell persistence (% of baseline) 0 20 40 60 80 100 π 1= 0.05 π 2= 0.13 π 3= 0.33 π 4= 0.49 1 2 3 4 Dose Level

Stage 2 Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy For the first patient in Stage 2, randomize to allowable doses j = 1,.., J based on π j. As data becomes available, update randomization probabilities for accruing patients. Repeat until total sample size is achieved, or some other stopping criteria is met. When DLTs are observed, utilize likelihood inference to determine if dose is toxic.

Example: cohort of size 5 Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Example: p 1 = 0.40; p 2 = 0.15 Require likelihood ratio 2 (in favor of p 1 ) to declare toxic. 0 or 1 DLT in 5 pts: allowable dose 2 DLTs in 5 pts: unacceptable dose (and all higher doses unacceptable) DLT L(p 1 )/L(p 2 ) k = 1 k = 2 k = 8 0 LR = 0.18 acceptable acceptable weak 1 LR = 0.66 acceptable weak weak 2 LR = 5.0 toxic toxic weak 3 LR = 19.0 toxic toxic toxic toxic: LR k; acceptable: LR 1 k ; weak: 1 k < LR < k

Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Aside: The 3+3 provides weak evidence for DLT rate H 0 : p 0 = 0.40 H 1 : p 1 = 0.15 DLT (n) 3+3 rule L(p 1 )/L(p 0 ) k = 1 k = 2 k = 8 0(3) acceptable LR = 2.84 acceptable acceptable weak 1(3) + 0(3) acceptable LR = 2.14 acceptable acceptable weak 2(3) toxic LR = 0.20 toxic toxic weak 3(3) toxic LR = 0.05 toxic toxic toxic 1(3) + 1(3) toxic LR = 0.57 toxic weak weak 1(3) + 2(3) toxic LR = 0.15 toxic toxic weak 1(3) + 3(3) toxic LR = 0.04 toxic toxic toxic Dose limiting toxicities (cohort size) Acceptable: LR k; toxic: LR 1/k; weak: 1/k < LR < k

Constrained Adaptive Randomization Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy Simulations: Six toxicity and five efficacy models; two variance structures; two sample sizes Safety constraints implemented using H 1 : p 1 = 0.40; H 2 : p 2 = 0.15; k = 2 Subset presented: six toxicity scenarios Plateau efficacy model Large variance across patients Smaller sample size (N = 25)

Results: Stage 1 & 2 Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy DLT rates: 0.05, 0.07, 0.10, 0.12, 0.15

Results: Stage 1 & 2 Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy DLT rates: 0.25, 0.25, 0.25, 0.25, 0.25

Results: Stage 1 & 2 Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy DLT rates: 0.05, 0.18, 0.29, 0.40, 0.51

Results: Stage 1 & 2 Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy DLT rates: 0.05, 0.42, 0.48, 0.50, 0.52

Results: Stage 1 & 2 Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy DLT rates: 0.02, 0.02, 0.02, 0.02, 0.02

Results: Stage 1 & 2 Stage 1 Stage 2 Stages 1 & 2: Toxicity and Efficacy DLT rates: 0.02, 0.05, 0.28, 0.40, 0.50

Phase II: Sequential Trial Design Motivating Example Single arm cancer clinical trial Outcome = clinical response Early stopping for futility Standard frequentist approach Simon two-stage design Only one look at the data Determine optimality criterion minimax minimum E(N) under H0 (Simon s optimal) Likelihood approach Use binomial likelihood Can look at the data after each observation

Phase II: Sequential Trial Design Motivating Example New cancer treatment agent Anticipated response rate is 40% (p 2 = 0.40) Null response rate is 20% (p 1 = 0.20) the standard of care yields 20% not worth pursuing new treatment with same response rate as current treatment Using frequentist approach: Simon two-stage with α = β = 0.10 Optimum criterion: smallest E(N) First stage: enroll 17. If 4 or more respond, continue Second stage: enroll 20. If 11 or more respond, conclude success. Maximum sample size = 37

Implementation Look at the data after each patient Estimate the difference in log(l 1 ) and log(l 2 ) Rules: If log(l 1 ) log(l 2 ) log(k): stop for futility If log(l 1 ) log(l 2 ) < log(k): continue enrolling patients

Stopping Rules Given discrete nature, only certain looks provide an opportunity to stop Current example: stop the study if 0 responses in 9 patients 1 response in 12 patients 2 responses in 15 patients 3 responses in 19 patients 4 responses in 22 patients 5 responses in 26 patients 6 responses in 29 patients 7 responses in 32 patients Number of Responses 7 6 5 4 3 2 1 Continue Study Stop Study for Futility Although total N can be as large as 37, there are only 8 thresholds for futility early stopping assessment 0 0 4 8 12 16 20 24 28 32 36 Sample size

Design Performance Characteristics How does the proposed approach compare to the optimal Simon two-stage design? What are performance characteristics we would be interested in? small E(N) under the null hypothesis frequent stopping under null infrequent stopping under alternative acceptance of H 1 under H 1 acceptance of H 2 under H 2

Simon Optimal vs. Likelihood Probability 0.0 0.2 0.4 0.6 0.8 1.0 Accept HA, Lik Accept H0, Lik Weak Evidence Accept HA, Simon Accept H0, Simon 0.1 0.2 0.3 0.4 0.5 0.6 True p

Probability of Early Stopping Probability of Stopping Early 0.0 0.2 0.4 0.6 0.8 1.0 Likelihood (optimal N) Likelihood (minmax N) Simon Optimal Simon MinMax 0.1 0.2 0.3 0.4 0.5 0.6 True p

Expected Sample Size Expected Sample Size 10 15 20 25 30 35 Likelihood (optimal N) Likelihood (minmax N) Simon Optimal Simon MinMax 0.2 0.4 0.6 0.8 True p

Bayesian Competitor: Predictive Probability Lee and Liu, Clinical Trials, 2008. Bayesian but without loss function (no Bayes risk) Searches for design parameters to ensure size and power Prior is chosen so that mean of prior is the null hypothesis, but it s a weak prior. Predictive probability (P P ) = probability that the end result of the trial is positive given current data and data to be observed Based on the probability that the true response rate is greater than the null response rate. Ignores the alternative

Bayesian Competitor Predictive Probability Stopping: Based on pre-defined thresholds, θ L and θ U If P P < θ L : stop the trial and reject alternative If P P > θ U : stop the trial and reject the null (but often θ U = 1) At pre-specified times, the predictive probability is calculated Lee and Liu explore difference frequency of looks Comparisons here are for looking at every patient θ T is defined as the threshold for determining efficacy at the trial s end θ T and θ U do not have the same stringency

Comparison of Designs P ET 1 E(N 1 ) Simon Two-Stage 0.55 26.0 Predictive Probability 0.85 25.1 Likelihood 0.80 22.1

Monitoring Safety Can be used to monitor safety, e.g. transplant-related mortality It is set-up so that too many events trigger stopping (instead of too few) More appropriate than: Repeated confidence intervals repeated p-values (controlling alpha?) Like a one-sided SP RT

Evidential methods have been around a long time They allow us to: Simply quantify evidence for competing hypotheses Evaluate the data frequently with bounded probability of misleading evidence Likelihood methods are simple to implement and require no prior Sufficient work has been done to justify K thresholds for Phase III trials More work should be done to provide reasonable guidance on appropriate K in early phase trials

Thank you! Email: garrettm@musc.edu; chiuzan@musc.edu Work supported by the NIH grant, P01 CA 154778-01 and by the Biostatistics Shared Resource, Hollings Cancer Center, MUSC (P30 CA138313)