Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Size: px
Start display at page:

Download "Basic Statistics and Data Analysis for Health Researchers from Foreign Countries"

Transcription

1 Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma The Research Unit for General Practice in Copenhagen Dias 1

2 Content Quantifying association between continuous variables. In particular: Correlation (Simple) regression Dias 2

3 Example Newly diagnosed Type 2 Diabetes pt glucose bmi sex age A data set with 729 newly diagnosed Type 2 diabetes patients. pt: Patient ID glucose: Diagnostic plasma glucose (mmol/l) bmi: sex: age: Body Mass Index (kg/m2) sex (1=male, 0=female) age (years) Dias 3

4 Research question Do fat people have a more severe diabetes when the diabetes is discovered? Or in a more statistical language: Is diagnostic plasma glucose (positively) associated with the body mass index at the time of diagnosis? Dias 4

5 Scatter-plot When investigating a potential association between only two variables (like diagnostic plasma glucose and BMI) a scatterplot is an important part of the analysis. It gives insight in the nature of the association. It shows problems in the data, e.g. outliers, strange or impossible values. Dias 5

6 Scatter-plot Dias 6

7 Scatter-plot There is no apparent tendency, specifically not one that would support our research question and if we have to point out a tendency, it would be that high BMI associates with lower diagnostic glucose (why is this not so strange if we think about the diagnosis of diabetes?). There seem to be some very large values, especially for diagnostic plasma glucose. These are valid measurements. Maybe a log transformation of glucose would make associations more apparent? Dias 7

8 Scatter-plot R code plot(diabetes$bmi,diabetes$glucose, frame=true, main=null, xlab= BMI (kg/m2), ylab= Glucose (mmol/l), col= green, pch=19) Dias 8

9 Scatter-plot log transformation Dias 9

10 Measures of association We want to capture the association between two variables in a single number: a correlation coefficient, a measure of association. Suppose that Y i is the diagnostic plasma glucose of patient i and X i the BMI for the same person. Then we want our measure of association to have the following characteristics: A positive association indicates that if X i is large (relative to the rest of the sample) then Y i is likely to be large as well. A negative association indicates that if X i is large then Y i is likely to be small. Dias 10

11 Measures of association between -1 and 1 0 : No association 1 : perfect positive association -1 : Perfect negative association Dias 11

12 Measures of association for the diabetes data r = ρ = τ = Dias 12

13 Measures of association for the diabetes data and log transformed r = ρ = τ = Only the first one changes! Dias 13

14 Pearson s correlation coefficient Pearson s correlation coefficient is computed from the data set (X i, Y i ), i = 1,,N as: X Y r = N i= 1 ( X where and are the respective means and SD x and SD y the respective standard deviations. i X )( Y ( N 1) SD SD x i Y ) y Dias 14

15 Characteristics of Pearson s correlation coefficient Pearson s correlation coefficient has the following properties: It measures the degree of linear association. It is invariant to linear change of scale for the variables. It is not robust to outliers. Coefficient values that are comparable between different data sets, and moreover a valid confidence interval and p-value, require that both X i and Y i are normally distributed. Dias 15

16 Pearson s correlation coefficient R code > cor(diabetes$bmi,diabetes$glucose,use= complete.obs ) [1] Gives only the correlation coefficient. > cor.test(diabetes$bmi,diabetes$glucose) Pearson's product-moment correlation data: diabetes$bmi and diabetes$glucose t = , df = 723, p-value = alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor Also performs a statistical test to see whether the coefficient is different from zero. Dias 16

17 Normally distributed? BMI Glucose A Normal distribution for comparison. Dias 17

18 Normally distributed? BMI Log(Glucose) Dias 18

19 Normally distributed? Dias 19

20 Normally distributed? Dias 20

21 R code A histogram of BMI: hist(diabetes$bmi,main= BMI,xlab= BMI (kg/m2),col= green ) A Normal Q-Q plot of BMI: qqnorm(diabetes$bmi,main= BMI,col= green ) qqline(diabetes$bmi,col= red ) And how do we get all these works of art in some decent format? jpeg(file= D:\mydirectory\mypicture.jpg,width=500,height=500) # # put here the code that generates the picture # dev.off() Dias 21

22 Rank correlation Spearman s ρ If data does not appear to be Normally distributed, or when there are outliers, one may instead compute the correlation between the ranks of the X i values and the ranks of the Y i values. This gives a nonparametric correlation coefficient called Spearman s ρ. It measures monotone association. It is invariant to monotone transformations (like a log transformation). It is robust to outliers. It has an odd interpretation. Dias 22

23 Spearman s rank correlation coefficient R code > cor.test(diabetes$bmi,diabetes$glucose,method= spearman ) Spearman's rank correlation rho data: diabetes$bmi and diabetes$glucose S = , p-value = alternative hypothesis: true rho is not equal to 0 sample estimates: rho Warning message: In cor.test.default(diabetes$bmi, diabetes$glucose, method = "spearman") : Cannot compute exact p-values with ties Dias 23

24 Rank correlation Kendall s τ A measure of monotone association with a more intuitive interpretation than Spearman s ρ is Kendall s τ. The observations from a pair of subjects i, j are and concordant if X i < X j and Y i < Y j or X i > X j and Y i > Y j discordant if X i < X j and Y i > Y j or X i > X j and Y i < Y j Kendall s τ is the difference between the probability for a concordant pair and the probability for a discordant pair. There are various versions of Kendall s τ depending on how ties are treated. Dias 24

25 Characteristics of Kendall s tau It measures monotone association. It is invariant to monotone transformations (like a log transformation). It is robust to outliers. It has a more straightforward interpretation than Spearman s rho. Dias 25

26 Kendall s rank correlation coefficient R code > cor.test(diabetes$bmi,diabetes$glucose,method= kendall ) Kendall's rank correlation tau data: diabetes$bmi and diabetes$glucose z = , p-value = alternative hypothesis: true tau is not equal to 0 sample estimates: tau Dias 26

27 Correlation in the diabetes data r = (p = 0.110) ρ = (p = 0.180) τ = (p = 0.169) Dias 27

28 Correlation in the diabetes data and log transformed r = (p = 0.154) ρ = (p = 0.180) τ = (p = 0.169) Dias 28

29 Limitations of correlation coefficients While it is (relatively) clear what a correlation coefficient of 0 means, and also 1 or -1, it is often unclear what a highly significant correlation of, say, 0.5 means Correlation rarely answers the research question to a sufficient extend; because it is not easily interpretable. Coefficients of correlation depend on the sample selection and therefore we cannot compare values of the coefficients found in different data. Dias 29

30 Dias 30 Department of Biostatistics

31 Regression analysis An (intuitively interpretable) way to describe a (linear) association between two continuous type variables. It models a response Y (the dependent variable, the exogenous variable, the output) as a function of a predictor X (the independent variable, the exogenous variable, the explanatory variable, the covariate) and a term representing random other influences (error, noise). Dias 31

32 Regression model formulation We say: To regress Y on X or: To regress glucose on BMI Mathematically: Y i = α + βx i + ε i Where ε i are independently Normal distributed noise terms with mean 0 and standard deviation σ. Dias 32

33 Regression model The mean of Y is modelled with a linear function of X; a line in the X-Y plane. For each X, Y is a random variable Normally distributed around the modelled mean of Y, with standard deviation σ Dias 33

34 Scatter-plot with regression line Dias 34

35 Interpretation of the parameters We have variation due to a systematic part, the explanatory variable, and a random part, the noise. The systematic part of the model is defined by the regression line. α = the intercept: mean level for Y i when X i = 0 β = the slope: mean increase for Y i when X i is increased 1 unit. Dias 35

36 Research question Do fat people have a more severe diabetes when the diabetes is discovered? Or in a more statistical language: Is diagnostic plasma glucose (positively) associated with the body mass index at the time of diagnosis? In a (simple) linear regression analysis, is the slope β different from 0 (or more pertinently, larger than 0)? Dias 36

37 How does the model answer the research question? Interest may focus on making a simple hypothesis about the two parameters: Null hypothesis : β = 0 Null hypothesis : α = 0 The second hypothesis often has no (clinical) meaning. Dias 37

38 Linear regression R code > mymodel <- lm(diabetes$glucose~diabetes$bmi) > summary(mymodel) Call: lm(formula = diabetes$glucose ~ diabetes$bmi) Residuals: Min 1Q Median 3Q Max Estimate of the slope P-value of the test for the null hypothesis β = 0. Coefficients: Estimate Std. Error t value Pr(> t ) Table with (Intercept) <2e-16 *** parameter diabetes$bmi estimates --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 723 degrees of freedom (4 observations deleted due to missingness) Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 723 DF, p-value: Dias 38

39 Plot of regression line R code The lm() function can be used to plot the regression line in the scatter-plot: > plot(diabetes$bmi,diabetes$glucose) > mymodel <- lm(diabetes$glucose~diabetes$bmi) > abline(mymodel) Dias 39

40 Scatter-plot with regression line log transformed glucose Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** diabetes$bmi Dias 40

41 How are the parameters estimated? The estimated parameters of the linear model define the line (found among all possible lines) which minimizes the squared distance between the data-points and the line in the scatter-plot. The estimation method is called ordinary least-squares (maximum likelihood gives the same answer). Dias 41

42 Least squares fit Dias 42

43 Does the model fit the data? Dias 43

44 Diagnostic plots Dias 44

45 Diagnostic plots R produces some diagnostic plots (of varying usefulness). The residuals (the error or noise) was supposed to be Normal distributed, this can be studied in the Q-Q plot (top right) More importantly, the residuals should have a single standard deviation, i.e. the variance should not increase with, for example, BMI. This can be studied in the residuals vs. fitted plot (top left) > mymodel <- lm(diabetes$glucose~diabetes$bmi) > opar <- par(mfrow = c(2,2), oma = c(0,0,1.1,0)) > plot(mymodel) > par(opar) Dias 45

46 Data transformations If the residuals are not Normal, or (and this is more serious because the central limit theorem deals with much of the non- Normality issue) if variance seems to increase with level, it may be a good idea to transform one or both variables. This is the real reason to investigate log(glucose) instead of glucose. Dias 46

47 Data transformations log transform Dias 47

48 The influence of one outlier Dias 48

49 Simpson s paradox Florida death penalty verdicts for homicide relative to defendant s race White Black 11% (53/430) 8% (15/176) Dias 49

50 Simpson s paradox Victim white Victim black White Black 11% 23% (53/414) (11/37) 0% 3% (0/16) (4/139) Blacks tend to murder blacks and whites tend to murder whites and the murder of a white person has a higher probability of death penalty. For any victim the probability for a black person to get death penalty is about 2 times higher. Dias 50

51 Confounding Victim s race We are interested in the green highlighted association, but there is a correlation with the victim s race both with the defendant s race and the outcome of the trial. Defendant s race Death penalty Dias 51

52 Confounding A confounder influences both exposure and outcome Confounder When confounding is present we cannot interpret the green highlighted association as causal Exposure Outcome Dias 52

53 Randomization Exposure randomised Confounder Outcome Often there are many factors that may influence both exposure and outcome, some of them may not be observed or are unknown. If exposure is randomised, then there is no confounding. The green highlighted association can be interpreted causal. Dias 53

54 Two regressions The blue points denote patients with SBP>140 mmhg; the blue line the corresponding regression line. The red points denote patients with SBP < 140 mmhg; the red line the corresponding regression line. The black line is the general regression line. The slopes from the stratified analyses are less steep than the slope of the general line. Dias 54

55 Multiple regression > mymodel <- lm(log(diabetes$glucose)~diabetes$bmi+diabetes$sbp) > summary(mymodel) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** diabetes$bmi diabetes$sbp * --- Signif. codes: 0 *** ** 0.01 * The adjusted slope (association) of bmi is less pronounced than before. SBP is related to both glucose and bmi and is a confounder. Dias 55

56 Multiple regression Adjusting a statistical analysis means to include other predictor variables into the model formula. Intuitively, a slope for BMI is determined for each level of the SBP variable separately and these are then averaged. including SBP in the analysis removes the confounding effect of SBP from the relationship between log(glucose) and BMI. Dias 56

57 Take home message Association between two continuous variables may be measured by correlation coefficients or in (simple) linear regression analysis. The latter provides arguably the best interpretable results. Moreover, it is straightforwardly extended to be able to deal with confounding, and more Dias 57

5. Linear Regression

5. Linear Regression 5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is

More information

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION EDUCATION AND VOCABULARY 5-10 hours of input weekly is enough to pick up a new language (Schiff & Myers, 1988). Dutch children spend 5.5 hours/day

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate

More information

Correlation and Simple Linear Regression

Correlation and Simple Linear Regression Correlation and Simple Linear Regression We are often interested in studying the relationship among variables to determine whether they are associated with one another. When we think that changes in a

More information

Week 5: Multiple Linear Regression

Week 5: Multiple Linear Regression BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

Multivariate Logistic Regression

Multivariate Logistic Regression 1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation

More information

, then the form of the model is given by: which comprises a deterministic component involving the three regression coefficients (

, then the form of the model is given by: which comprises a deterministic component involving the three regression coefficients ( Multiple regression Introduction Multiple regression is a logical extension of the principles of simple linear regression to situations in which there are several predictor variables. For instance if we

More information

Yiming Peng, Department of Statistics. February 12, 2013

Yiming Peng, Department of Statistics. February 12, 2013 Regression Analysis Using JMP Yiming Peng, Department of Statistics February 12, 2013 2 Presentation and Data http://www.lisa.stat.vt.edu Short Courses Regression Analysis Using JMP Download Data to Desktop

More information

Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

More information

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96 1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

More information

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

We extended the additive model in two variables to the interaction model by adding a third term to the equation. Quadratic Models We extended the additive model in two variables to the interaction model by adding a third term to the equation. Similarly, we can extend the linear model in one variable to the quadratic

More information

Using R for Linear Regression

Using R for Linear Regression Using R for Linear Regression In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional

More information

Regression in ANOVA. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Regression in ANOVA. James H. Steiger. Department of Psychology and Human Development Vanderbilt University Regression in ANOVA James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) 1 / 30 Regression in ANOVA 1 Introduction 2 Basic Linear

More information

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression

Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression Quantitative Understanding in Biology Module II: Model Parameter Estimation Lecture I: Linear Correlation and Regression Correlation Linear correlation and linear regression are often confused, mostly

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Chapter 13 Introduction to Linear Regression and Correlation Analysis Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing

More information

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

Chapter 5: Linear regression

Chapter 5: Linear regression Chapter 5: Linear regression Last lecture: Ch 4............................................................ 2 Next: Ch 5................................................................. 3 Simple linear

More information

Testing for Lack of Fit

Testing for Lack of Fit Chapter 6 Testing for Lack of Fit How can we tell if a model fits the data? If the model is correct then ˆσ 2 should be an unbiased estimate of σ 2. If we have a model which is not complex enough to fit

More information

7. Tests of association and Linear Regression

7. Tests of association and Linear Regression 7. Tests of association and Linear Regression In this chapter we consider 1. Tests of Association for 2 qualitative variables. 2. Measures of the strength of linear association between 2 quantitative variables.

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

17. SIMPLE LINEAR REGRESSION II

17. SIMPLE LINEAR REGRESSION II 17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.

More information

DATA INTERPRETATION AND STATISTICS

DATA INTERPRETATION AND STATISTICS PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE

More information

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,

More information

Regression. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Regression. Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question. Class: Date: Regression Multiple Choice Identify the choice that best completes the statement or answers the question. 1. Given the least squares regression line y8 = 5 2x: a. the relationship between

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation

More information

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material

More information

Statistiek II. John Nerbonne. March 24, 2010. Information Science, Groningen Slides improved a lot by Harmut Fitz, Groningen!

Statistiek II. John Nerbonne. March 24, 2010. Information Science, Groningen Slides improved a lot by Harmut Fitz, Groningen! Information Science, Groningen j.nerbonne@rug.nl Slides improved a lot by Harmut Fitz, Groningen! March 24, 2010 Correlation and regression We often wish to compare two different variables Examples: compare

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

AMS7: WEEK 8. CLASS 1. Correlation Monday May 18th, 2015

AMS7: WEEK 8. CLASS 1. Correlation Monday May 18th, 2015 AMS7: WEEK 8. CLASS 1 Correlation Monday May 18th, 2015 Type of Data and objectives of the analysis Paired sample data (Bivariate data) Determine whether there is an association between two variables This

More information

Final Exam Practice Problem Answers

Final Exam Practice Problem Answers Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal

More information

Part II. Multiple Linear Regression

Part II. Multiple Linear Regression Part II Multiple Linear Regression 86 Chapter 7 Multiple Regression A multiple linear regression model is a linear model that describes how a y-variable relates to two or more xvariables (or transformations

More information

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information

Interpretation of Somers D under four simple models

Interpretation of Somers D under four simple models Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms

More information

Module 5 Hypotheses Tests: Comparing Two Groups

Module 5 Hypotheses Tests: Comparing Two Groups Module 5 Hypotheses Tests: Comparing Two Groups Objective: In medical research, we often compare the outcomes between two groups of patients, namely exposed and unexposed groups. At the completion of this

More information

Exercise Page 1 of 32

Exercise Page 1 of 32 Exercise 10.1 (a) Plot wages versus LOS. Describe the relationship. There is one woman with relatively high wages for her length of service. Circle this point and do not use it in the rest of this exercise.

More information

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

More information

Inferential Statistics

Inferential Statistics Inferential Statistics Sampling and the normal distribution Z-scores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

where b is the slope of the line and a is the intercept i.e. where the line cuts the y axis.

where b is the slope of the line and a is the intercept i.e. where the line cuts the y axis. Least Squares Introduction We have mentioned that one should not always conclude that because two variables are correlated that one variable is causing the other to behave a certain way. However, sometimes

More information

SCHOOL OF MATHEMATICS AND STATISTICS

SCHOOL OF MATHEMATICS AND STATISTICS RESTRICTED OPEN BOOK EXAMINATION (Not to be removed from the examination hall) Data provided: Statistics Tables by H.R. Neave MAS5052 SCHOOL OF MATHEMATICS AND STATISTICS Basic Statistics Spring Semester

More information

Comparing Nested Models

Comparing Nested Models Comparing Nested Models ST 430/514 Two models are nested if one model contains all the terms of the other, and at least one additional term. The larger model is the complete (or full) model, and the smaller

More information

Categorical Data Analysis

Categorical Data Analysis Richard L. Scheaffer University of Florida The reference material and many examples for this section are based on Chapter 8, Analyzing Association Between Categorical Variables, from Statistical Methods

More information

Spearman s correlation

Spearman s correlation Spearman s correlation Introduction Before learning about Spearman s correllation it is important to understand Pearson s correlation which is a statistical measure of the strength of a linear relationship

More information

Deterministic and Stochastic Modeling of Insulin Sensitivity

Deterministic and Stochastic Modeling of Insulin Sensitivity Deterministic and Stochastic Modeling of Insulin Sensitivity Master s Thesis in Engineering Mathematics and Computational Science ELÍN ÖSP VILHJÁLMSDÓTTIR Department of Mathematical Science Chalmers University

More information

Bivariate Analysis. Correlation. Correlation. Pearson's Correlation Coefficient. Variable 1. Variable 2

Bivariate Analysis. Correlation. Correlation. Pearson's Correlation Coefficient. Variable 1. Variable 2 Bivariate Analysis Variable 2 LEVELS >2 LEVELS COTIUOUS Correlation Used when you measure two continuous variables. Variable 2 2 LEVELS X 2 >2 LEVELS X 2 COTIUOUS t-test X 2 X 2 AOVA (F-test) t-test AOVA

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb 2 2015

Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb 2 2015 Stat 412/512 CASE INFLUENCE STATISTICS Feb 2 2015 Charlotte Wickham stat512.cwick.co.nz Regression in your field See website. You may complete this assignment in pairs. Find a journal article in your field

More information

Econometrics Simple Linear Regression

Econometrics Simple Linear Regression Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information

Perform hypothesis testing

Perform hypothesis testing Multivariate hypothesis tests for fixed effects Testing homogeneity of level-1 variances In the following sections, we use the model displayed in the figure below to illustrate the hypothesis tests. Partial

More information

Example: Boats and Manatees

Example: Boats and Manatees Figure 9-6 Example: Boats and Manatees Slide 1 Given the sample data in Table 9-1, find the value of the linear correlation coefficient r, then refer to Table A-6 to determine whether there is a significant

More information

Section 3 Part 1. Relationships between two numerical variables

Section 3 Part 1. Relationships between two numerical variables Section 3 Part 1 Relationships between two numerical variables 1 Relationship between two variables The summary statistics covered in the previous lessons are appropriate for describing a single variable.

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

Exchange Rate Regime Analysis for the Chinese Yuan

Exchange Rate Regime Analysis for the Chinese Yuan Exchange Rate Regime Analysis for the Chinese Yuan Achim Zeileis Ajay Shah Ila Patnaik Abstract We investigate the Chinese exchange rate regime after China gave up on a fixed exchange rate to the US dollar

More information

Logistic Regression (a type of Generalized Linear Model)

Logistic Regression (a type of Generalized Linear Model) Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge

More information

Paired Differences and Regression

Paired Differences and Regression Paired Differences and Regression Students sometimes have difficulty distinguishing between paired data and independent samples when comparing two means. One can return to this topic after covering simple

More information

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU PITFALLS IN TIME SERIES ANALYSIS Cliff Hurvich Stern School, NYU The t -Test If x 1,..., x n are independent and identically distributed with mean 0, and n is not too small, then t = x 0 s n has a standard

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova

More information

Handling missing data in Stata a whirlwind tour

Handling missing data in Stata a whirlwind tour Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled

More information

E205 Final: Version B

E205 Final: Version B Name: Class: Date: E205 Final: Version B Multiple Choice Identify the choice that best completes the statement or answers the question. 1. The owner of a local nightclub has recently surveyed a random

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 5: Linear least-squares Regression III: Advanced Methods William G. Jacoby Department of Political Science Michigan State University http://polisci.msu.edu/jacoby/icpsr/regress3 Simple Linear Regression

More information

Regression in SPSS. Workshop offered by the Mississippi Center for Supercomputing Research and the UM Office of Information Technology

Regression in SPSS. Workshop offered by the Mississippi Center for Supercomputing Research and the UM Office of Information Technology Regression in SPSS Workshop offered by the Mississippi Center for Supercomputing Research and the UM Office of Information Technology John P. Bentley Department of Pharmacy Administration University of

More information

e = random error, assumed to be normally distributed with mean 0 and standard deviation σ

e = random error, assumed to be normally distributed with mean 0 and standard deviation σ 1 Linear Regression 1.1 Simple Linear Regression Model The linear regression model is applied if we want to model a numeric response variable and its dependency on at least one numeric factor variable.

More information

What is correlational research?

What is correlational research? Key Ideas Purpose and use of correlational designs How correlational research developed Types of correlational designs Key characteristics of correlational designs Procedures used in correlational studies

More information

Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

More information

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the

More information

Simple Linear Regression

Simple Linear Regression Chapter Nine Simple Linear Regression Consider the following three scenarios: 1. The CEO of the local Tourism Authority would like to know whether a family s annual expenditure on recreation is related

More information

Epidemiology-Biostatistics Exam Exam 2, 2001 PRINT YOUR LEGAL NAME:

Epidemiology-Biostatistics Exam Exam 2, 2001 PRINT YOUR LEGAL NAME: Epidemiology-Biostatistics Exam Exam 2, 2001 PRINT YOUR LEGAL NAME: Instructions: This exam is 30% of your course grade. The maximum number of points for the course is 1,000; hence this exam is worth 300

More information

SPSS Guide: Regression Analysis

SPSS Guide: Regression Analysis SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Stat 411/511 ANOVA & REGRESSION. Charlotte Wickham. stat511.cwick.co.nz. Nov 31st 2015

Stat 411/511 ANOVA & REGRESSION. Charlotte Wickham. stat511.cwick.co.nz. Nov 31st 2015 Stat 411/511 ANOVA & REGRESSION Nov 31st 2015 Charlotte Wickham stat511.cwick.co.nz This week Today: Lack of fit F-test Weds: Review email me topics, otherwise I ll go over some of last year s final exam

More information

Psychology 205: Research Methods in Psychology

Psychology 205: Research Methods in Psychology Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Regression Analysis: A Complete Example

Regression Analysis: A Complete Example Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty

More information

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this

More information

Using SPSS for Multiple Regression. UDP 520 Lab 7 Lin Lin December 4 th, 2007

Using SPSS for Multiple Regression. UDP 520 Lab 7 Lin Lin December 4 th, 2007 Using SPSS for Multiple Regression UDP 520 Lab 7 Lin Lin December 4 th, 2007 Step 1 Define Research Question What factors are associated with BMI? Predict BMI. Step 2 Conceptualizing Problem (Theory) Individual

More information

The importance of graphing the data: Anscombe s regression examples

The importance of graphing the data: Anscombe s regression examples The importance of graphing the data: Anscombe s regression examples Bruce Weaver Northern Health Research Conference Nipissing University, North Bay May 30-31, 2008 B. Weaver, NHRC 2008 1 The Objective

More information

CORRELATION AND SIMPLE REGRESSION ANALYSIS USING SAS IN DAIRY SCIENCE

CORRELATION AND SIMPLE REGRESSION ANALYSIS USING SAS IN DAIRY SCIENCE CORRELATION AND SIMPLE REGRESSION ANALYSIS USING SAS IN DAIRY SCIENCE A. K. Gupta, Vipul Sharma and M. Manoj NDRI, Karnal-132001 When analyzing farm records, simple descriptive statistics can reveal a

More information

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To

More information

MTH 140 Statistics Videos

MTH 140 Statistics Videos MTH 140 Statistics Videos Chapter 1 Picturing Distributions with Graphs Individuals and Variables Categorical Variables: Pie Charts and Bar Graphs Categorical Variables: Pie Charts and Bar Graphs Quantitative

More information

Principles of Hypothesis Testing for Public Health

Principles of Hypothesis Testing for Public Health Principles of Hypothesis Testing for Public Health Laura Lee Johnson, Ph.D. Statistician National Center for Complementary and Alternative Medicine johnslau@mail.nih.gov Fall 2011 Answers to Questions

More information

The scatterplot indicates a positive linear relationship between waist size and body fat percentage:

The scatterplot indicates a positive linear relationship between waist size and body fat percentage: STAT E-150 Statistical Methods Multiple Regression Three percent of a man's body is essential fat, which is necessary for a healthy body. However, too much body fat can be dangerous. For men between the

More information

Statistics for Sports Medicine

Statistics for Sports Medicine Statistics for Sports Medicine Suzanne Hecht, MD University of Minnesota (suzanne.hecht@gmail.com) Fellow s Research Conference July 2012: Philadelphia GOALS Try not to bore you to death!! Try to teach

More information

MULTIPLE REGRESSION EXAMPLE

MULTIPLE REGRESSION EXAMPLE MULTIPLE REGRESSION EXAMPLE For a sample of n = 166 college students, the following variables were measured: Y = height X 1 = mother s height ( momheight ) X 2 = father s height ( dadheight ) X 3 = 1 if

More information

Pearson s correlation

Pearson s correlation Pearson s correlation Introduction Often several quantitative variables are measured on each member of a sample. If we consider a pair of such variables, it is frequently of interest to establish if there

More information

STAT 350 Practice Final Exam Solution (Spring 2015)

STAT 350 Practice Final Exam Solution (Spring 2015) PART 1: Multiple Choice Questions: 1) A study was conducted to compare five different training programs for improving endurance. Forty subjects were randomly divided into five groups of eight subjects

More information

Simple Linear Regression Chapter 11

Simple Linear Regression Chapter 11 Simple Linear Regression Chapter 11 Rationale Frequently decision-making situations require modeling of relationships among business variables. For instance, the amount of sale of a product may be related

More information

Simple Linear Regression

Simple Linear Regression Inference for Regression Simple Linear Regression IPS Chapter 10.1 2009 W.H. Freeman and Company Objectives (IPS Chapter 10.1) Simple linear regression Statistical model for linear regression Estimating

More information

Ordinal Regression. Chapter

Ordinal Regression. Chapter Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

More information

How to choose a statistical test. Francisco J. Candido dos Reis DGO-FMRP University of São Paulo

How to choose a statistical test. Francisco J. Candido dos Reis DGO-FMRP University of São Paulo How to choose a statistical test Francisco J. Candido dos Reis DGO-FMRP University of São Paulo Choosing the right test One of the most common queries in stats support is Which analysis should I use There

More information

Sydney Roberts Predicting Age Group Swimmers 50 Freestyle Time 1. 1. Introduction p. 2. 2. Statistical Methods Used p. 5. 3. 10 and under Males p.

Sydney Roberts Predicting Age Group Swimmers 50 Freestyle Time 1. 1. Introduction p. 2. 2. Statistical Methods Used p. 5. 3. 10 and under Males p. Sydney Roberts Predicting Age Group Swimmers 50 Freestyle Time 1 Table of Contents 1. Introduction p. 2 2. Statistical Methods Used p. 5 3. 10 and under Males p. 8 4. 11 and up Males p. 10 5. 10 and under

More information

Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is

Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is In this lab we will look at how R can eliminate most of the annoying calculations involved in (a) using Chi-Squared tests to check for homogeneity in two-way tables of catagorical data and (b) computing

More information

Using Excel for inferential statistics

Using Excel for inferential statistics FACT SHEET Using Excel for inferential statistics Introduction When you collect data, you expect a certain amount of variation, just caused by chance. A wide variety of statistical tests can be applied

More information