Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

Save this PDF as:

Size: px
Start display at page:

Download "Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups"

Transcription

1 Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln Log-Rank Test for More Than Two Groups Prepared by Harlan Sayles (SRAM) Revised by Julia Soulakova (Statistics) The Log-Rank test for more than two groups allows for comparing rates of events of interest (e.g., death, remission from cancer, machine part failure) between more than two distinct groups. A population of interest may be cancer patients who undergo certain cancer treatments, a batch of machine parts, or criminals recently released from prison. The corresponding events of interest may be death, part failure, or recidivism (arrested for a new crime) in each of these respective cases. The groups within these populations may be formed from the race of the patient for the persons who have undergone cancer treatments, the aluminum alloy composition of the different machine parts to be tested, and the socioeconomic status of the recently released prisoners. The Log-Rank test can be used to answer questions similar to the following: 1. Are death rates for persons who have undergone an experimental treatment for colon cancer different based on a person s race (e.g., compare Blacks, Whites, Hispanics, and others)? 2. Does the rate or failure for machine parts differ based upon the composition of the aluminum alloy that they are made from (e.g., compare parts that are identical in size and function, but made from different metals)? 3. Do recently released criminals with low socioeconomic status tend to have higher recidivism rates than criminals with medium or high socioeconomic status? Data Considerations The Log-Rank test allows for two types of non-perfect survival data, lefttruncated data and right-censored data. Ideally, all subjects would be observed starting at time 0 and each subject would experience the event of interest prior to the conclusion of the study period. For practical reasons, however, this is not always possible. For left-truncated data, the event of interest happens prior to the beginning of the study period and because of this, the subject is not included in the study. For example, if a patient dies of colon cancer prior to receiving the experimental treatment, he can not be included in the study and we say that our data are left-truncated. For right-censored data,

2 the subject s event of interest does not occur during the study period or before it, but the subject is at least observed for the duration of the study period, so some information is known about the subject s survival. For the example of the machine parts, if the time allowed for a researcher to complete the study is limited to six months, and at the end of six months some parts have not yet failed, then the researcher must end the test and accept that the failure time for the remaining parts is greater than six months (assuming that all parts will eventually fail). The time data corresponding to the parts that do not fail within six months are said to be right-censored because it is known that the failure time is greater than six months, but exactly how much greater is not known. Several variables are required in order to perform a Log-Rank test for a comparison between more than two groups. First, the dataset must contain an event indicator variable as it is necessary in all survival analysis experiments. For ease of use, the indicator should be coded as a 1 if the event happens within the observation period, or a 0 if the event does not happen within the observation period, i.e., the case is rightcensored. The data must also include a variable representing the amount of time on study. This variable should correspond to the time from the beginning of the observational period until either the event of interest (event indicator = 1) or censoring (event indicator = 0). Finally, the dataset must include a group indicator variable to indicate which comparison group each subject belongs to. The group indicator variable may be either numeric (e.g. 1, 2, 3, 4) or a string (e.g. Black, White, Hispanic, Other ). Alternatively, if the group variable is continuous and needs to be converted to a discrete variable, this can be done with the LIFETEST procedure in SAS by specifying a list of intervals following the variable name in the SAS syntax (see the SAS Help and Documentation for details). The test statistics for the Log-Rank test are based on large-sample approximations. Larger sample sizes will lead to more reliable results, as is the case with many statistical procedures. The number of comparison groups should not be allowed to get too large to avoid having comparison groups with too few subjects. Each group should contain at least 30 subjects (and preferably more) for best results. Main Ideas The basic concept behind the Log-Rank test is to allow for a test of the null hypothesis that the hazard functions for all groups are equal for all study time versus the alternative hypothesis that at least one hazard function is different from the others for some time during the study. To put it into mathematical terms for k groups H 0 : h 1 (t) = h 2 (t) = = h k (t) for all t τ H A : at least one of the h j (t) s is different for some t τ where τ is the largest time during which each group has at least one subject at risk (at least one subject in each group has not experienced the event of interest or been censored). The Log-Rank test will compare the hazards from each group at each event time between 0 and τ. Each event time is defined as each unique time when any subject from

3 any group experiences the event of interest, regardless of whether any other subject in any of the other groups experiences the event or not. The total number of unique event times in the population is represented by D. If all hazards are equal for all groups, then it is assumed that the proportion of each group experiencing the event at any given event time t i will be equal to the proportion of the overall population experiencing the event at that same time. If d ij is the number of events experienced by group j at event time t i, Y ij is the number of persons at risk in group j just prior to time t i, d j is the total number of events experienced by the entire study population at event time t i, and Y i is the total number of persons at risk in the entire study population just prior to time t i, then it is assumed that d ij / Y ij = d i / Y i for all event times t i, i = 1, 2,, D, j = 1, 2,, k The Log-Rank test begins by calculating a statistic representing the sum of the weighted differences between d ij / Y ij and d i / Y i at each event time t i for each group j = 1 through k. For the Log-Rank test, the weights applied to these differences are all equal to 1, so each event time has an equal weighting on the value of the statistics. Variances and covariances for each of these k statistics are also calculated using slightly more advanced formulas (see Klein and Moeschberger, 2003, pg 207). The statistics calculated for the k groups are linearly dependent and therefore, we may use only k-1 of them to calculate a test statistic. To calculate the test statistic, k-1 of the statistics are formed into a vector called Z. The variances and covariances for these k-1 statistics are placed into a variance-covariance matrix called Σ (again, see Klein and Moeschberger, 2003, pg 207 for details). A test statistic is then calculated by equation 1 χ 2 = Z(Σ -1 )Z t (1) which has a chi-squared distribution with k-1 degrees of freedom when the null hypothesis is true. An Example The data for this example are empirical. They are for illustration purposes only! The data set is very small to allow for simplicity of illustration, and consists of three variables representing group membership (Race), time in the study in months (Months), and an event indicator (Status) that equals 1 if the person has experienced the event and 0 if the case is right-censored. Assume that this data represents a group of 10 persons who have undergone an experimental treatment for colon cancer. The data are shown in Table 1 below. The SAS code below can be used to create the data set and perform the analyses. The Data command simply creates the data set. The Proc Sort command is necessary to sort the data set by the grouping variable prior to running the Proc Lifetest command. SAS will expect the data to be sorted by the grouping variable and may not function properly if the data are not sorted.

4 Table 1. Fictitious Colon Cancer Patient Data Race Months Status Black 9 1 Black 15 1 Other 17 1 Black 20 1 Other 24 1 Hispanic 25 1 Hispanic 31 1 White 40 1 Hispanic 48 0 White 48 0 The Proc Lifetest command uses the Product-Limit estimator, by default, to estimate the survival curves (Figure 1). Unfortunately, this method does not allow for a graph of the hazard functions for each group to be printed directly, but it does give output for the Log-Rank test. The Log-Rank test output is requested as an option in the strata statement with the syntax test=(logrank). Incidentally, the Log-Rank test is the default test output for the strata statement, along with a couple of other tests, so the results of the Logrank test would be included in the output even if the test=(logrank) option were left out of the syntax, but by including this option, only the Log-Rank results are given. SAS output for the Lifetest Procedure is presented after the figure. SAS Syntax Data Cancer; input Race \$ Months Status; datalines; Black 3 1 Black 9 1 Other 19 1 Black 39 1 Hispanic 41 1 Other 45 1 Hispanic 45 1 White 46 1 Hispanic 48 0 White 48 0 run; Proc Sort data=work.cancer; By Race; Run; Proc Lifetest data=work.cancer method=pl plots=(s(name=survival)) graphics; time Months*Status(0); strata Race / test=(logrank); symbol1 v=none color=black line=1; symbol2 v=none color=red line=3; symbol3 v=none color=blue line=27; symbol4 v=none color=green line=41; run;

5 Figure 1. Survival Graphs for method=pl SAS Output The LIFETEST Procedure Stratum 1: Race = Black Product-Limit Survival Estimates Survival Standard Number Number Months Survival Failure Error Failed Left Summary Statistics for Time Variable Months Quartile Estimates Point 95% Confidence Interval Percent Estimate [Lower Upper) Mean Standard Error

6 Stratum 2: Race = Hispanic Product-Limit Survival Estimates Survival Standard Number Number Months Survival Failure Error Failed Left * NOTE: The marked survival times are censored observations. Summary Statistics for Time Variable Months Quartile Estimates Point 95% Confidence Interval Percent Estimate [Lower Upper) Mean Standard Error NOTE: The mean survival time and its standard error were underestimated because the largest observation was censored and the estimation was restricted to the largest event time. Stratum 3: Race = Other Product-Limit Survival Estimates Survival Standard Number Number Months Survival Failure Error Failed Left Summary Statistics for Time Variable Months Quartile Estimates Point 95% Confidence Interval Percent Estimate [Lower Upper) Mean Standard Error

7 Stratum 4: Race = White Product-Limit Survival Estimates Survival Standard Number Number Months Survival Failure Error Failed Left * NOTE: The marked survival times are censored observations. Summary Statistics for Time Variable Months Quartile Estimates Point 95% Confidence Interval Percent Estimate [Lower Upper) Mean Standard Error NOTE: The mean survival time and its standard error were underestimated because the largest observation was censored and the estimation was restricted to the largest event time. Summary of the Number of Censored and Uncensored Values Percent Stratum Race Total Failed Censored Censored 1 Black Hispanic Other White Total Testing Homogeneity of Survival Curves for Months over Strata Rank Statistics Race Log-Rank Black Hispanic Other White

8 Covariance Matrix for the Log-Rank Statistics Race Black Hispanic Other White Black Hispanic Other White Test of Equality over Strata Pr > Test Chi-Square DF Chi-Square Log-Rank The output shows results for each value (stratum) of the group identification variable, which is Race in this example. The values listed under Months for each stratum indicate event times or censoring times for cases within that stratum. If a time is a censoring time, it is marked with an asterisk. The next column labeled Survival is the Product-Limit Estimator of the survival function at that time point. The third column, labeled Failure gives the percent of cases in the stratum that have failed (experienced the event) up to that time point. Failure is equal to 1-Survival. The fourth column is the standard error of the Product-Limit survival estimation. The fifth and sixth columns are the total number in the stratum that have failed and the total number in the stratum that have not failed or been censored up to a given time point, respectively. The next part of the output for each stratum calculates estimates of survival percentiles. These numbers for the 25 th, 50 th, and 75 th percentiles are the estimates at which 25%, 50%, and 75% of the subjects in the stratum have NOT survived. The standard errors for these estimates are also presented. These estimates are fairly meaningless with such a small sample size, but can be somewhat useful with a more appropriately sized group. The last statistic reported for each stratum is the mean survival time; that is the average amount of time that a subject in a given stratum is expected to survive. The standard error for this estimate is also given. This number is usually different from the 50% percentile, which is a median survival time as opposed to a mean survival time. The notes under stratums 2 and 4 (Hispanic and White) indicate the statistical problem of trying to calculate an average when not all of the values are known (i.e. the highest one is censored). The estimates of mean survival time are calculated using all of the known values, but they are certain to be underestimates, because the censored survival times must be larger than the largest known survival times and when these larger values are added into the calculation, the calculated mean will necessarily increase. Following the individual strata statistics is a summary table that presents the total number of subjects, failures, censored cases, and the percent of censored cases for each stratum. This information is redundant with information presented above for each individual stratum and is given only for convenience. The final part of the output labeled Testing Homogeneity of Survival Curves for Months over Strata contains all of the information necessary for performing the Log- Rank test and also the result of the test. The first part, labeled Rank Statistics are the

9 sum of the weighted differences between the observed hazards d ij / Y ij for each of the four racial groups and the population hazard d i / Y i at each distinct event time. The second part is the variance-covariance matrix for the rank statistics, calculated using the equations on page 207 of Klein and Moeschberger (2003). Finally, using equation 1 from above, the test statistic for the Log-Rank test is calculated to be , which is distributed as a chisquared random variable with 3 degrees of freedom. The p-value for this statistic is , which is less than the standard alpha cutoff value of 0.5, so the test has rejected the null hypothesis that all strata hazards are equal for all t τ. The graph in Figure 1 shows that there is some evidence of differences between the survival functions across the four strata; early on, the survival for Blacks is much lower than the one for other three groups. Unfortunately, the SAS Lifetest Procedure with the Product-Limit estimator does not allow for a plotting of the hazard curves for each stratum. Such a graph would likely not be very useful anyways, as the hazard function will be equal to zero for all strata at times other than the D distinct event times. The Log-Rank test provides evidence that at least one of the four stratum hazard plots is significantly different from the others for some value of t τ, meaning that there is an effect of race on subject survival time. Hence, there is a significant difference among the survival rates of White, Black, Hispanic and others.

Gamma Distribution Fitting

Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

Survival Analysis Using SPSS. By Hui Bian Office for Faculty Excellence

Survival Analysis Using SPSS By Hui Bian Office for Faculty Excellence Survival analysis What is survival analysis Event history analysis Time series analysis When use survival analysis Research interest

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

Life Table Analysis using Weighted Survey Data

Life Table Analysis using Weighted Survey Data James G. Booth and Thomas A. Hirschl June 2005 Abstract Formulas for constructing valid pointwise confidence bands for survival distributions, estimated using

Comparison of Survival Curves

Comparison of Survival Curves We spent the last class looking at some nonparametric approaches for estimating the survival function, Ŝ(t), over time for a single sample of individuals. Now we want to compare

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

MINITAB ASSISTANT WHITE PAPER

MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

Linda Staub & Alexandros Gekenidis

Seminar in Statistics: Survival Analysis Chapter 2 Kaplan-Meier Survival Curves and the Log- Rank Test Linda Staub & Alexandros Gekenidis March 7th, 2011 1 Review Outcome variable of interest: time until

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Applied Statistics J. Blanchet and J. Wadsworth Institute of Mathematics, Analysis, and Applications EPF Lausanne An MSc Course for Applied Mathematicians, Fall 2012 Outline 1 Model Comparison 2 Model

Tests for Two Survival Curves Using Cox s Proportional Hazards Model

Chapter 730 Tests for Two Survival Curves Using Cox s Proportional Hazards Model Introduction A clinical trial is often employed to test the equality of survival distributions of two treatment groups.

SUMAN DUVVURU STAT 567 PROJECT REPORT

SUMAN DUVVURU STAT 567 PROJECT REPORT SURVIVAL ANALYSIS OF HEROIN ADDICTS Background and introduction: Current illicit drug use among teens is continuing to increase in many countries around the world.

Recall this chart that showed how most of our course would be organized:

Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

One-Way ANOVA using SPSS 11.0. SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

1 One-Way ANOVA using SPSS 11.0 This section covers steps for testing the difference between three or more group means using the SPSS ANOVA procedures found in the Compare Means analyses. Specifically,

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

One-Degree-of-Freedom Tests Test for group occasion interactions has (number of groups 1) number of occasions 1) degrees of freedom. This can dilute the significance of a departure from the null hypothesis.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

11. Analysis of Case-control Studies Logistic Regression

Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012]

Survival Analysis of Left Truncated Income Protection Insurance Data [March 29, 2012] 1 Qing Liu 2 David Pitt 3 Yan Wang 4 Xueyuan Wu Abstract One of the main characteristics of Income Protection Insurance

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different

Chi-square test Fisher s Exact test

Lesson 1 Chi-square test Fisher s Exact test McNemar s Test Lesson 1 Overview Lesson 11 covered two inference methods for categorical data from groups Confidence Intervals for the difference of two proportions

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

Parkland College A with Honors Projects Honors Program 2014 Calculating P-Values Isela Guerra Parkland College Recommended Citation Guerra, Isela, "Calculating P-Values" (2014). A with Honors Projects.

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA) As with other parametric statistics, we begin the one-way ANOVA with a test of the underlying assumptions. Our first assumption is the assumption of

Developing Business Failure Prediction Models Using SAS Software Oki Kim, Statistical Analytics

Paper SD-004 Developing Business Failure Prediction Models Using SAS Software Oki Kim, Statistical Analytics ABSTRACT The credit crisis of 2008 has changed the climate in the investment and finance industry.

Simple Linear Regression Inference

Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

Use of deviance statistics for comparing models

A likelihood-ratio test can be used under full ML. The use of such a test is a quite general principle for statistical testing. In hierarchical linear models, the deviance test is mostly used for multiparameter

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Statistics I for QBIC Text Book: Biostatistics, 10 th edition, by Daniel & Cross Contents and Objectives Chapters 1 7 Revised: August 2013 Chapter 1: Nature of Statistics (sections 1.1-1.6) Objectives

Chapter 7. One-way ANOVA

Chapter 7 One-way ANOVA One-way ANOVA examines equality of population means for a quantitative outcome and a single categorical explanatory variable with any number of levels. The t-test of Chapter 6 looks

Lecture 2 ESTIMATING THE SURVIVAL FUNCTION. One-sample nonparametric methods

Lecture 2 ESTIMATING THE SURVIVAL FUNCTION One-sample nonparametric methods There are commonly three methods for estimating a survivorship function S(t) = P (T > t) without resorting to parametric models:

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To

Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

Consider a study in which. How many subjects? The importance of sample size calculations. An insignificant effect: two possibilities.

Consider a study in which How many subjects? The importance of sample size calculations Office of Research Protections Brown Bag Series KB Boomer, Ph.D. Director, boomer@stat.psu.edu A researcher conducts

Chapter Additional: Standard Deviation and Chi- Square

Chapter Additional: Standard Deviation and Chi- Square Chapter Outline: 6.4 Confidence Intervals for the Standard Deviation 7.5 Hypothesis testing for Standard Deviation Section 6.4 Objectives Interpret

Unit 29 Chi-Square Goodness-of-Fit Test

Unit 29 Chi-Square Goodness-of-Fit Test Objectives: To perform the chi-square hypothesis test concerning proportions corresponding to more than two categories of a qualitative variable To perform the Bonferroni

INTRODUCTORY STATISTICS

INTRODUCTORY STATISTICS FIFTH EDITION Thomas H. Wonnacott University of Western Ontario Ronald J. Wonnacott University of Western Ontario WILEY JOHN WILEY & SONS New York Chichester Brisbane Toronto Singapore

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Quadratic Models We extended the additive model in two variables to the interaction model by adding a third term to the equation. Similarly, we can extend the linear model in one variable to the quadratic

Descriptive Statistics

Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

Statistical Analysis The First Steps Jennifer L. Waller Medical College of Georgia, Augusta, Georgia

Statistical Analysis The First Steps Jennifer L. Waller Medical College of Georgia, Augusta, Georgia ABSTRACT For both statisticians and non-statisticians, knowing what data look like before more rigorous

Multiple Linear Regression

Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is

Section 12 Part 2. Chi-square test

Section 12 Part 2 Chi-square test McNemar s Test Section 12 Part 2 Overview Section 12, Part 1 covered two inference methods for categorical data from 2 groups Confidence Intervals for the difference of

Multivariate Analysis of Variance (MANOVA): I. Theory

Gregory Carey, 1998 MANOVA: I - 1 Multivariate Analysis of Variance (MANOVA): I. Theory Introduction The purpose of a t test is to assess the likelihood that the means for two groups are sampled from the

Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

Chapter 5 Analysis of variance SPSS Analysis of variance

Chapter 5 Analysis of variance SPSS Analysis of variance Data file used: gss.sav How to get there: Analyze Compare Means One-way ANOVA To test the null hypothesis that several population means are equal,

Profile analysis is the multivariate equivalent of repeated measures or mixed ANOVA. Profile analysis is most commonly used in two cases:

Profile Analysis Introduction Profile analysis is the multivariate equivalent of repeated measures or mixed ANOVA. Profile analysis is most commonly used in two cases: ) Comparing the same dependent variables

Scatter Plots with Error Bars

Chapter 165 Scatter Plots with Error Bars Introduction The procedure extends the capability of the basic scatter plot by allowing you to plot the variability in Y and X corresponding to each point. Each

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses

Using Repeated Measures Techniques To Analyze Cluster-correlated Survey Responses G. Gordon Brown, Celia R. Eicheldinger, and James R. Chromy RTI International, Research Triangle Park, NC 27709 Abstract

1 SAMPLE SIGN TEST. Non-Parametric Univariate Tests: 1 Sample Sign Test 1. A non-parametric equivalent of the 1 SAMPLE T-TEST.

Non-Parametric Univariate Tests: 1 Sample Sign Test 1 1 SAMPLE SIGN TEST A non-parametric equivalent of the 1 SAMPLE T-TEST. ASSUMPTIONS: Data is non-normally distributed, even after log transforming.

Factors affecting online sales

Factors affecting online sales Table of contents Summary... 1 Research questions... 1 The dataset... 2 Descriptive statistics: The exploratory stage... 3 Confidence intervals... 4 Hypothesis tests... 4

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

Introduction to Stata

Introduction to Stata September 23, 2014 Stata is one of a few statistical analysis programs that social scientists use. Stata is in the mid-range of how easy it is to use. Other options include SPSS,

Estimating survival functions has interested statisticians for numerous years.

ZHAO, GUOLIN, M.A. Nonparametric and Parametric Survival Analysis of Censored Data with Possible Violation of Method Assumptions. (2008) Directed by Dr. Kirsten Doehler. 55pp. Estimating survival functions

Regression Analysis: A Complete Example

Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty

2 Right Censoring and Kaplan-Meier Estimator

2 Right Censoring and Kaplan-Meier Estimator In biomedical applications, especially in clinical trials, two important issues arise when studying time to event data (we will assume the event to be death.

PASS Sample Size Software

Chapter 250 Introduction The Chi-square test is often used to test whether sets of frequencies or proportions follow certain patterns. The two most common instances are tests of goodness of fit using multinomial

Main Effects and Interactions

Main Effects & Interactions page 1 Main Effects and Interactions So far, we ve talked about studies in which there is just one independent variable, such as violence of television program. You might randomly

Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures

Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone:

An Application of Weibull Analysis to Determine Failure Rates in Automotive Components

An Application of Weibull Analysis to Determine Failure Rates in Automotive Components Jingshu Wu, PhD, PE, Stephen McHenry, Jeffrey Quandt National Highway Traffic Safety Administration (NHTSA) U.S. Department

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative

Using R for Linear Regression

Using R for Linear Regression In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@

Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,

NCSS Statistical Software

Chapter 06 Introduction This procedure provides several reports for the comparison of two distributions, including confidence intervals for the difference in means, two-sample t-tests, the z-test, the

Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration

Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the

NCSS Statistical Software

Chapter 06 Introduction This procedure provides several reports for the comparison of two distributions, including confidence intervals for the difference in means, two-sample t-tests, the z-test, the

13: Additional ANOVA Topics. Post hoc Comparisons

13: Additional ANOVA Topics Post hoc Comparisons ANOVA Assumptions Assessing Group Variances When Distributional Assumptions are Severely Violated Kruskal-Wallis Test Post hoc Comparisons In the prior

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541 libname in1 >c:\=; Data first; Set in1.extract; A=1; PROC LOGIST OUTEST=DD MAXITER=100 ORDER=DATA; OUTPUT OUT=CC XBETA=XB P=PROB; MODEL

An analysis method for a quantitative outcome and two categorical explanatory variables.

Chapter 11 Two-Way ANOVA An analysis method for a quantitative outcome and two categorical explanatory variables. If an experiment has a quantitative outcome and two categorical explanatory variables that

Microsoft Excel. Qi Wei

Microsoft Excel Qi Wei Excel (Microsoft Office Excel) is a spreadsheet application written and distributed by Microsoft for Microsoft Windows and Mac OS X. It features calculation, graphing tools, pivot

9-3.4 Likelihood ratio test. Neyman-Pearson lemma

9-3.4 Likelihood ratio test Neyman-Pearson lemma 9-1 Hypothesis Testing 9-1.1 Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental

CHAPTER 11 CHI-SQUARE: NON-PARAMETRIC COMPARISONS OF FREQUENCY

CHAPTER 11 CHI-SQUARE: NON-PARAMETRIC COMPARISONS OF FREQUENCY The hypothesis testing statistics detailed thus far in this text have all been designed to allow comparison of the means of two or more samples

Poisson Models for Count Data

Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

Understanding Confidence Intervals and Hypothesis Testing Using Excel Data Table Simulation

Understanding Confidence Intervals and Hypothesis Testing Using Excel Data Table Simulation Leslie Chandrakantha lchandra@jjay.cuny.edu Department of Mathematics & Computer Science John Jay College of

AP Statistics 2010 Scoring Guidelines

AP Statistics 2010 Scoring Guidelines The College Board The College Board is a not-for-profit membership association whose mission is to connect students to college success and opportunity. Founded in

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Occam s razor.......................................................... 2 A look at data I.........................................................

Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

Basic Statistical and Modeling Procedures Using SAS

Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom

Part 3. Comparing Groups. Chapter 7 Comparing Paired Groups 189. Chapter 8 Comparing Two Independent Groups 217

Part 3 Comparing Groups Chapter 7 Comparing Paired Groups 189 Chapter 8 Comparing Two Independent Groups 217 Chapter 9 Comparing More Than Two Groups 257 188 Elementary Statistics Using SAS Chapter 7 Comparing

Data Analysis. Lecture Empirical Model Building and Methods (Empirische Modellbildung und Methoden) SS Analysis of Experiments - Introduction

Data Analysis Lecture Empirical Model Building and Methods (Empirische Modellbildung und Methoden) Prof. Dr. Dr. h.c. Dieter Rombach Dr. Andreas Jedlitschka SS 2014 Analysis of Experiments - Introduction

SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.

SIMPLE LINEAR CORRELATION Simple linear correlation is a measure of the degree to which two variables vary together, or a measure of the intensity of the association between two variables. Correlation

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

Distribution (Weibull) Fitting

Chapter 550 Distribution (Weibull) Fitting Introduction This procedure estimates the parameters of the exponential, extreme value, logistic, log-logistic, lognormal, normal, and Weibull probability distributions

New SAS Procedures for Analysis of Sample Survey Data

New SAS Procedures for Analysis of Sample Survey Data Anthony An and Donna Watts, SAS Institute Inc, Cary, NC Abstract Researchers use sample surveys to obtain information on a wide variety of issues Many

Efficacy analysis and graphical representation in Oncology trials - A case study

Efficacy analysis and graphical representation in Oncology trials - A case study Anindita Bhattacharjee Vijayalakshmi Indana Cytel, Pune The views expressed in this presentation are our own and do not

Dongfeng Li. Autumn 2010

Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

General Method: Difference of Means. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n 1, n 2 ) 1.

General Method: Difference of Means 1. Calculate x 1, x 2, SE 1, SE 2. 2. Combined SE = SE1 2 + SE2 2. ASSUMES INDEPENDENT SAMPLES. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n

Statistical Models in R

Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Lesson : Comparison of Population Means Part c: Comparison of Two- Means Welcome to lesson c. This third lesson of lesson will discuss hypothesis testing for two independent means. Steps in Hypothesis

STATA Tutorial Professor Erdinç Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software 1.Wald Test Wald Test is used

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

Multivariate Logistic Regression

1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

Chapter 6: t test for dependent samples

Chapter 6: t test for dependent samples ****This chapter corresponds to chapter 11 of your book ( t(ea) for Two (Again) ). What it is: The t test for dependent samples is used to determine whether the

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING In this lab you will explore the concept of a confidence interval and hypothesis testing through a simulation problem in engineering setting.

More details on the inputs, functionality, and output can be found below.

Overview: The SMEEACT (Software for More Efficient, Ethical, and Affordable Clinical Trials) web interface (http://research.mdacc.tmc.edu/smeeactweb) implements a single analysis of a two-armed trial comparing

Generalized Linear Models

Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

Statistics for Clinical Trial SAS Programmers 1: paired t-test Kevin Lee, Covance Inc., Conshohocken, PA

Statistics for Clinical Trial SAS Programmers 1: paired t-test Kevin Lee, Covance Inc., Conshohocken, PA ABSTRACT This paper is intended for SAS programmers who are interested in understanding common statistical