Basic Statistics and Data Analysis for Health Researchers from Foreign Countries
|
|
- Emery Heath
- 4 years ago
- Views:
Transcription
1 Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma The Research Unit for General Practice in Copenhagen Dias 1
2 Content Quantifying association between continuous variables. In particular: Correlation (Simple) regression Dias 2
3 Example Newly diagnosed Type 2 Diabetes pt glucose bmi sex age A data set with 729 newly diagnosed Type 2 diabetes patients. pt: Patient ID glucose: Diagnostic plasma glucose (mmol/l) bmi: sex: age: Body Mass Index (kg/m2) sex (1=male, 0=female) age (years) Dias 3
4 Research question Do fat people have a more severe diabetes when the diabetes is discovered? Or in a more statistical language: Is diagnostic plasma glucose (positively) associated with the body mass index at the time of diagnosis? Dias 4
5 Scatter-plot When investigating a potential association between only two variables (like diagnostic plasma glucose and BMI) a scatterplot is an important part of the analysis. It gives insight in the nature of the association. It shows problems in the data, e.g. outliers, strange or impossible values. Dias 5
6 Scatter-plot Dias 6
7 Scatter-plot There is no apparent tendency, specifically not one that would support our research question and if we have to point out a tendency, it would be that high BMI associates with lower diagnostic glucose (why is this not so strange if we think about the diagnosis of diabetes?). There seem to be some very large values, especially for diagnostic plasma glucose. These are valid measurements. Maybe a log transformation of glucose would make associations more apparent? Dias 7
8 Scatter-plot R code plot(diabetes$bmi,diabetes$glucose, frame=true, main=null, xlab= BMI (kg/m2), ylab= Glucose (mmol/l), col= green, pch=19) Dias 8
9 Scatter-plot log transformation Dias 9
10 Measures of association We want to capture the association between two variables in a single number: a correlation coefficient, a measure of association. Suppose that Y i is the diagnostic plasma glucose of patient i and X i the BMI for the same person. Then we want our measure of association to have the following characteristics: A positive association indicates that if X i is large (relative to the rest of the sample) then Y i is likely to be large as well. A negative association indicates that if X i is large then Y i is likely to be small. Dias 10
11 Measures of association between -1 and 1 0 : No association 1 : perfect positive association -1 : Perfect negative association Dias 11
12 Measures of association for the diabetes data r = ρ = τ = Dias 12
13 Measures of association for the diabetes data and log transformed r = ρ = τ = Only the first one changes! Dias 13
14 Pearson s correlation coefficient Pearson s correlation coefficient is computed from the data set (X i, Y i ), i = 1,,N as: X Y r = N i= 1 ( X where and are the respective means and SD x and SD y the respective standard deviations. i X )( Y ( N 1) SD SD x i Y ) y Dias 14
15 Characteristics of Pearson s correlation coefficient Pearson s correlation coefficient has the following properties: It measures the degree of linear association. It is invariant to linear change of scale for the variables. It is not robust to outliers. Coefficient values that are comparable between different data sets, and moreover a valid confidence interval and p-value, require that both X i and Y i are normally distributed. Dias 15
16 Pearson s correlation coefficient R code > cor(diabetes$bmi,diabetes$glucose,use= complete.obs ) [1] Gives only the correlation coefficient. > cor.test(diabetes$bmi,diabetes$glucose) Pearson's product-moment correlation data: diabetes$bmi and diabetes$glucose t = , df = 723, p-value = alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: sample estimates: cor Also performs a statistical test to see whether the coefficient is different from zero. Dias 16
17 Normally distributed? BMI Glucose A Normal distribution for comparison. Dias 17
18 Normally distributed? BMI Log(Glucose) Dias 18
19 Normally distributed? Dias 19
20 Normally distributed? Dias 20
21 R code A histogram of BMI: hist(diabetes$bmi,main= BMI,xlab= BMI (kg/m2),col= green ) A Normal Q-Q plot of BMI: qqnorm(diabetes$bmi,main= BMI,col= green ) qqline(diabetes$bmi,col= red ) And how do we get all these works of art in some decent format? jpeg(file= D:\mydirectory\mypicture.jpg,width=500,height=500) # # put here the code that generates the picture # dev.off() Dias 21
22 Rank correlation Spearman s ρ If data does not appear to be Normally distributed, or when there are outliers, one may instead compute the correlation between the ranks of the X i values and the ranks of the Y i values. This gives a nonparametric correlation coefficient called Spearman s ρ. It measures monotone association. It is invariant to monotone transformations (like a log transformation). It is robust to outliers. It has an odd interpretation. Dias 22
23 Spearman s rank correlation coefficient R code > cor.test(diabetes$bmi,diabetes$glucose,method= spearman ) Spearman's rank correlation rho data: diabetes$bmi and diabetes$glucose S = , p-value = alternative hypothesis: true rho is not equal to 0 sample estimates: rho Warning message: In cor.test.default(diabetes$bmi, diabetes$glucose, method = "spearman") : Cannot compute exact p-values with ties Dias 23
24 Rank correlation Kendall s τ A measure of monotone association with a more intuitive interpretation than Spearman s ρ is Kendall s τ. The observations from a pair of subjects i, j are and concordant if X i < X j and Y i < Y j or X i > X j and Y i > Y j discordant if X i < X j and Y i > Y j or X i > X j and Y i < Y j Kendall s τ is the difference between the probability for a concordant pair and the probability for a discordant pair. There are various versions of Kendall s τ depending on how ties are treated. Dias 24
25 Characteristics of Kendall s tau It measures monotone association. It is invariant to monotone transformations (like a log transformation). It is robust to outliers. It has a more straightforward interpretation than Spearman s rho. Dias 25
26 Kendall s rank correlation coefficient R code > cor.test(diabetes$bmi,diabetes$glucose,method= kendall ) Kendall's rank correlation tau data: diabetes$bmi and diabetes$glucose z = , p-value = alternative hypothesis: true tau is not equal to 0 sample estimates: tau Dias 26
27 Correlation in the diabetes data r = (p = 0.110) ρ = (p = 0.180) τ = (p = 0.169) Dias 27
28 Correlation in the diabetes data and log transformed r = (p = 0.154) ρ = (p = 0.180) τ = (p = 0.169) Dias 28
29 Limitations of correlation coefficients While it is (relatively) clear what a correlation coefficient of 0 means, and also 1 or -1, it is often unclear what a highly significant correlation of, say, 0.5 means Correlation rarely answers the research question to a sufficient extend; because it is not easily interpretable. Coefficients of correlation depend on the sample selection and therefore we cannot compare values of the coefficients found in different data. Dias 29
30 Dias 30 Department of Biostatistics
31 Regression analysis An (intuitively interpretable) way to describe a (linear) association between two continuous type variables. It models a response Y (the dependent variable, the exogenous variable, the output) as a function of a predictor X (the independent variable, the exogenous variable, the explanatory variable, the covariate) and a term representing random other influences (error, noise). Dias 31
32 Regression model formulation We say: To regress Y on X or: To regress glucose on BMI Mathematically: Y i = α + βx i + ε i Where ε i are independently Normal distributed noise terms with mean 0 and standard deviation σ. Dias 32
33 Regression model The mean of Y is modelled with a linear function of X; a line in the X-Y plane. For each X, Y is a random variable Normally distributed around the modelled mean of Y, with standard deviation σ Dias 33
34 Scatter-plot with regression line Dias 34
35 Interpretation of the parameters We have variation due to a systematic part, the explanatory variable, and a random part, the noise. The systematic part of the model is defined by the regression line. α = the intercept: mean level for Y i when X i = 0 β = the slope: mean increase for Y i when X i is increased 1 unit. Dias 35
36 Research question Do fat people have a more severe diabetes when the diabetes is discovered? Or in a more statistical language: Is diagnostic plasma glucose (positively) associated with the body mass index at the time of diagnosis? In a (simple) linear regression analysis, is the slope β different from 0 (or more pertinently, larger than 0)? Dias 36
37 How does the model answer the research question? Interest may focus on making a simple hypothesis about the two parameters: Null hypothesis : β = 0 Null hypothesis : α = 0 The second hypothesis often has no (clinical) meaning. Dias 37
38 Linear regression R code > mymodel <- lm(diabetes$glucose~diabetes$bmi) > summary(mymodel) Call: lm(formula = diabetes$glucose ~ diabetes$bmi) Residuals: Min 1Q Median 3Q Max Estimate of the slope P-value of the test for the null hypothesis β = 0. Coefficients: Estimate Std. Error t value Pr(> t ) Table with (Intercept) <2e-16 *** parameter diabetes$bmi estimates --- Signif. codes: 0 *** ** 0.01 * Residual standard error: on 723 degrees of freedom (4 observations deleted due to missingness) Multiple R-squared: , Adjusted R-squared: F-statistic: on 1 and 723 DF, p-value: Dias 38
39 Plot of regression line R code The lm() function can be used to plot the regression line in the scatter-plot: > plot(diabetes$bmi,diabetes$glucose) > mymodel <- lm(diabetes$glucose~diabetes$bmi) > abline(mymodel) Dias 39
40 Scatter-plot with regression line log transformed glucose Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** diabetes$bmi Dias 40
41 How are the parameters estimated? The estimated parameters of the linear model define the line (found among all possible lines) which minimizes the squared distance between the data-points and the line in the scatter-plot. The estimation method is called ordinary least-squares (maximum likelihood gives the same answer). Dias 41
42 Least squares fit Dias 42
43 Does the model fit the data? Dias 43
44 Diagnostic plots Dias 44
45 Diagnostic plots R produces some diagnostic plots (of varying usefulness). The residuals (the error or noise) was supposed to be Normal distributed, this can be studied in the Q-Q plot (top right) More importantly, the residuals should have a single standard deviation, i.e. the variance should not increase with, for example, BMI. This can be studied in the residuals vs. fitted plot (top left) > mymodel <- lm(diabetes$glucose~diabetes$bmi) > opar <- par(mfrow = c(2,2), oma = c(0,0,1.1,0)) > plot(mymodel) > par(opar) Dias 45
46 Data transformations If the residuals are not Normal, or (and this is more serious because the central limit theorem deals with much of the non- Normality issue) if variance seems to increase with level, it may be a good idea to transform one or both variables. This is the real reason to investigate log(glucose) instead of glucose. Dias 46
47 Data transformations log transform Dias 47
48 The influence of one outlier Dias 48
49 Simpson s paradox Florida death penalty verdicts for homicide relative to defendant s race White Black 11% (53/430) 8% (15/176) Dias 49
50 Simpson s paradox Victim white Victim black White Black 11% 23% (53/414) (11/37) 0% 3% (0/16) (4/139) Blacks tend to murder blacks and whites tend to murder whites and the murder of a white person has a higher probability of death penalty. For any victim the probability for a black person to get death penalty is about 2 times higher. Dias 50
51 Confounding Victim s race We are interested in the green highlighted association, but there is a correlation with the victim s race both with the defendant s race and the outcome of the trial. Defendant s race Death penalty Dias 51
52 Confounding A confounder influences both exposure and outcome Confounder When confounding is present we cannot interpret the green highlighted association as causal Exposure Outcome Dias 52
53 Randomization Exposure randomised Confounder Outcome Often there are many factors that may influence both exposure and outcome, some of them may not be observed or are unknown. If exposure is randomised, then there is no confounding. The green highlighted association can be interpreted causal. Dias 53
54 Two regressions The blue points denote patients with SBP>140 mmhg; the blue line the corresponding regression line. The red points denote patients with SBP < 140 mmhg; the red line the corresponding regression line. The black line is the general regression line. The slopes from the stratified analyses are less steep than the slope of the general line. Dias 54
55 Multiple regression > mymodel <- lm(log(diabetes$glucose)~diabetes$bmi+diabetes$sbp) > summary(mymodel) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) <2e-16 *** diabetes$bmi diabetes$sbp * --- Signif. codes: 0 *** ** 0.01 * The adjusted slope (association) of bmi is less pronounced than before. SBP is related to both glucose and bmi and is a confounder. Dias 55
56 Multiple regression Adjusting a statistical analysis means to include other predictor variables into the model formula. Intuitively, a slope for BMI is determined for each level of the SBP variable separately and these are then averaged. including SBP in the analysis removes the confounding effect of SBP from the relationship between log(glucose) and BMI. Dias 56
57 Take home message Association between two continuous variables may be measured by correlation coefficients or in (simple) linear regression analysis. The latter provides arguably the best interpretable results. Moreover, it is straightforwardly extended to be able to deal with confounding, and more Dias 57
5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
Multiple Linear Regression
Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate
EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION
EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION EDUCATION AND VOCABULARY 5-10 hours of input weekly is enough to pick up a new language (Schiff & Myers, 1988). Dutch children spend 5.5 hours/day
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression We are often interested in studying the relationship among variables to determine whether they are associated with one another. When we think that changes in a
Week 5: Multiple Linear Regression
BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
Part 2: Analysis of Relationship Between Two Variables
Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable
1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
Chapter 7: Simple linear regression Learning Objectives
Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -
We extended the additive model in two variables to the interaction model by adding a third term to the equation.
Quadratic Models We extended the additive model in two variables to the interaction model by adding a third term to the equation. Similarly, we can extend the linear model in one variable to the quadratic
Using R for Linear Regression
Using R for Linear Regression In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional
Section 14 Simple Linear Regression: Introduction to Least Squares Regression
Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship
Multivariate Logistic Regression
1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing
Introduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
Testing for Lack of Fit
Chapter 6 Testing for Lack of Fit How can we tell if a model fits the data? If the model is correct then ˆσ 2 should be an unbiased estimate of σ 2. If we have a model which is not complex enough to fit
Simple linear regression
Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between
17. SIMPLE LINEAR REGRESSION II
17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.
Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares
Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
Comparing Nested Models
Comparing Nested Models ST 430/514 Two models are nested if one model contains all the terms of the other, and at least one additional term. The larger model is the complete (or full) model, and the smaller
Final Exam Practice Problem Answers
Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal
Simple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F
Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,
1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ
STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
Univariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a
DATA INTERPRETATION AND STATISTICS
PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE
Interpretation of Somers D under four simple models
Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms
X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)
CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.
Logistic Regression (a type of Generalized Linear Model)
Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
Stat 412/512 CASE INFLUENCE STATISTICS. Charlotte Wickham. stat512.cwick.co.nz. Feb 2 2015
Stat 412/512 CASE INFLUENCE STATISTICS Feb 2 2015 Charlotte Wickham stat512.cwick.co.nz Regression in your field See website. You may complete this assignment in pairs. Find a journal article in your field
Econometrics Simple Linear Regression
Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight
Descriptive Statistics
Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize
Deterministic and Stochastic Modeling of Insulin Sensitivity
Deterministic and Stochastic Modeling of Insulin Sensitivity Master s Thesis in Engineering Mathematics and Computational Science ELÍN ÖSP VILHJÁLMSDÓTTIR Department of Mathematical Science Chalmers University
Categorical Data Analysis
Richard L. Scheaffer University of Florida The reference material and many examples for this section are based on Chapter 8, Analyzing Association Between Categorical Variables, from Statistical Methods
Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011
Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this
Section 3 Part 1. Relationships between two numerical variables
Section 3 Part 1 Relationships between two numerical variables 1 Relationship between two variables The summary statistics covered in the previous lessons are appropriate for describing a single variable.
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management
KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To
Example: Boats and Manatees
Figure 9-6 Example: Boats and Manatees Slide 1 Given the sample data in Table 9-1, find the value of the linear correlation coefficient r, then refer to Table A-6 to determine whether there is a significant
MTH 140 Statistics Videos
MTH 140 Statistics Videos Chapter 1 Picturing Distributions with Graphs Individuals and Variables Categorical Variables: Pie Charts and Bar Graphs Categorical Variables: Pie Charts and Bar Graphs Quantitative
Part II. Multiple Linear Regression
Part II Multiple Linear Regression 86 Chapter 7 Multiple Regression A multiple linear regression model is a linear model that describes how a y-variable relates to two or more xvariables (or transformations
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
Section 1: Simple Linear Regression
Section 1: Simple Linear Regression Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction
PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU
PITFALLS IN TIME SERIES ANALYSIS Cliff Hurvich Stern School, NYU The t -Test If x 1,..., x n are independent and identically distributed with mean 0, and n is not too small, then t = x 0 s n has a standard
CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression
Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the
Regression Analysis: A Complete Example
Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty
MULTIPLE REGRESSION EXAMPLE
MULTIPLE REGRESSION EXAMPLE For a sample of n = 166 college students, the following variables were measured: Y = height X 1 = mother s height ( momheight ) X 2 = father s height ( dadheight ) X 3 = 1 if
5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
STAT 350 Practice Final Exam Solution (Spring 2015)
PART 1: Multiple Choice Questions: 1) A study was conducted to compare five different training programs for improving endurance. Forty subjects were randomly divided into five groups of eight subjects
Exchange Rate Regime Analysis for the Chinese Yuan
Exchange Rate Regime Analysis for the Chinese Yuan Achim Zeileis Ajay Shah Ila Patnaik Abstract We investigate the Chinese exchange rate regime after China gave up on a fixed exchange rate to the US dollar
The importance of graphing the data: Anscombe s regression examples
The importance of graphing the data: Anscombe s regression examples Bruce Weaver Northern Health Research Conference Nipissing University, North Bay May 30-31, 2008 B. Weaver, NHRC 2008 1 The Objective
Using Excel for inferential statistics
FACT SHEET Using Excel for inferential statistics Introduction When you collect data, you expect a certain amount of variation, just caused by chance. A wide variety of statistical tests can be applied
SPSS Guide: Regression Analysis
SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar
Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is
In this lab we will look at how R can eliminate most of the annoying calculations involved in (a) using Chi-Squared tests to check for homogeneity in two-way tables of catagorical data and (b) computing
Psychology 205: Research Methods in Psychology
Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready
Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance
Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance 14 November 2007 1 Confidence intervals and hypothesis testing for linear regression Just as there was
1 Simple Linear Regression I Least Squares Estimation
Simple Linear Regression I Least Squares Estimation Textbook Sections: 8. 8.3 Previously, we have worked with a random variable x that comes from a population that is normally distributed with mean µ and
Simple Linear Regression
Chapter Nine Simple Linear Regression Consider the following three scenarios: 1. The CEO of the local Tourism Authority would like to know whether a family s annual expenditure on recreation is related
Simple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics
Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in
INTRODUCTORY STATISTICS
INTRODUCTORY STATISTICS FIFTH EDITION Thomas H. Wonnacott University of Western Ontario Ronald J. Wonnacott University of Western Ontario WILEY JOHN WILEY & SONS New York Chichester Brisbane Toronto Singapore
ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2
University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages
MIXED MODEL ANALYSIS USING R
Research Methods Group MIXED MODEL ANALYSIS USING R Using Case Study 4 from the BIOMETRICS & RESEARCH METHODS TEACHING RESOURCE BY Stephen Mbunzi & Sonal Nagda www.ilri.org/rmg www.worldagroforestrycentre.org/rmg
Lucky vs. Unlucky Teams in Sports
Lucky vs. Unlucky Teams in Sports Introduction Assuming gambling odds give true probabilities, one can classify a team as having been lucky or unlucky so far. Do results of matches between lucky and unlucky
Generalized Linear Models
Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the
Stat 5303 (Oehlert): Tukey One Degree of Freedom 1
Stat 5303 (Oehlert): Tukey One Degree of Freedom 1 > catch
The Dummy s Guide to Data Analysis Using SPSS
The Dummy s Guide to Data Analysis Using SPSS Mathematics 57 Scripps College Amy Gamble April, 2001 Amy Gamble 4/30/01 All Rights Rerserved TABLE OF CONTENTS PAGE Helpful Hints for All Tests...1 Tests
A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study
A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study But I will offer a review, with a focus on issues which arise in finance 1 TYPES OF FINANCIAL
Ordinal Regression. Chapter
Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe
Correlation and Regression
Correlation and Regression Scatterplots Correlation Explanatory and response variables Simple linear regression General Principles of Data Analysis First plot the data, then add numerical summaries Look
Point Biserial Correlation Tests
Chapter 807 Point Biserial Correlation Tests Introduction The point biserial correlation coefficient (ρ in this chapter) is the product-moment correlation calculated between a continuous random variable
Regression step-by-step using Microsoft Excel
Step 1: Regression step-by-step using Microsoft Excel Notes prepared by Pamela Peterson Drake, James Madison University Type the data into the spreadsheet The example used throughout this How to is a regression
Handling missing data in Stata a whirlwind tour
Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled
Basic Statistical and Modeling Procedures Using SAS
Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom
We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries?
Statistics: Correlation Richard Buxton. 2008. 1 Introduction We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries? Do
A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn
A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps
Organizing Your Approach to a Data Analysis
Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize
2. Regression and Correlation. Simple Linear Regression Software: R
2. Regression and Correlation Simple Linear Regression Software: R Create txt file from SAS data set data _null_; file 'C:\Documents and Settings\sphlab\Desktop\slr1.txt'; set temp; put input day:date7.
HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1. used confidence intervals to answer questions such as...
HYPOTHESIS TESTING (ONE SAMPLE) - CHAPTER 7 1 PREVIOUSLY used confidence intervals to answer questions such as... You know that 0.25% of women have red/green color blindness. You conduct a study of men
n + n log(2π) + n log(rss/n)
There is a discrepancy in R output from the functions step, AIC, and BIC over how to compute the AIC. The discrepancy is not very important, because it involves a difference of a constant factor that cancels
containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.
Getting Correlations Using PROC CORR Correlation analysis provides a method to measure the strength of a linear relationship between two numeric variables. PROC CORR can be used to compute Pearson product-moment
2. Linear regression with multiple regressors
2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different
Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@
Developing Risk Adjustment Techniques Using the SAS@ System for Assessing Health Care Quality in the lmsystem@ Yanchun Xu, Andrius Kubilius Joint Commission on Accreditation of Healthcare Organizations,
Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2
Lesson 4 Part 1 Relationships between two numerical variables 1 Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables
Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)
Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume
August 2012 EXAMINATIONS Solution Part I
August 01 EXAMINATIONS Solution Part I (1) In a random sample of 600 eligible voters, the probability that less than 38% will be in favour of this policy is closest to (B) () In a large random sample,
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
Principles of Hypothesis Testing for Public Health
Principles of Hypothesis Testing for Public Health Laura Lee Johnson, Ph.D. Statistician National Center for Complementary and Alternative Medicine johnslau@mail.nih.gov Fall 2011 Answers to Questions
How To Run Statistical Tests in Excel
How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting
2. What is the general linear model to be used to model linear trend? (Write out the model) = + + + or
Simple and Multiple Regression Analysis Example: Explore the relationships among Month, Adv.$ and Sales $: 1. Prepare a scatter plot of these data. The scatter plots for Adv.$ versus Sales, and Month versus
Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013
Statistics I for QBIC Text Book: Biostatistics, 10 th edition, by Daniel & Cross Contents and Objectives Chapters 1 7 Revised: August 2013 Chapter 1: Nature of Statistics (sections 1.1-1.6) Objectives
Sales forecasting # 1
Sales forecasting # 1 Arthur Charpentier arthur.charpentier@univ-rennes1.fr 1 Agenda Qualitative and quantitative methods, a very general introduction Series decomposition Short versus long term forecasting