# 7. Tests of association and Linear Regression

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 7. Tests of association and Linear Regression In this chapter we consider 1. Tests of Association for 2 qualitative variables. 2. Measures of the strength of linear association between 2 quantitative variables. 3. Tests of association for 2 quantitative variables. 4. Linear Regression - Predicting one quantitative variable on the basis of another. 1 / 85

2 7.1 Chi-squared test for independence of qualitative variables The data for these tests are contingency tables showing the relationship between 2 qualitative variables. For example, suppose we have the following information regarding hair and eye colour Red hair Blonde hair Dark hair Blue eyes Brown eyes i.e. 30 people have red hair and brown eyes. 2 / 85

3 Test for independence Let n i,j be the entry in the i-th row and j-th column of the contingency table. We wish to choose between the hypotheses H 0 : hair colour and eye colour are independent. H A : hair and eye colour are dependent. 3 / 85

4 Row and column sums The number of people in the sample with blue eyes is the sum of the entries in the first row (150). The number of people in the sample with brown eyes is the sum of the entries in the second row (150). The sum of all the entries is the number of individuals in the sample (300). 4 / 85

5 Probabilities of given events The probability that an individual in the sample has blue eyes, P(blue), is the number of people with blue eyes divided by the number of people in the sample. Similarly, i.e. P(blue)= = 1 2 P(brown)= = / 85

6 Probabilities of given events The number of individuals with red hair is the sum of the number of entries in the first column (50). Arguing as above the probability the an individual in the sample has red hair is In a similar way, P(red) = = 1 6 P(blond) = = 1 2 P(dark) = = / 85

7 Probabilities under the hypothesis of independence If the traits are independent, then the probability that an individual has a given hair colour and given eye colour is the product of the two corresponding probabilities e.g. P(blond hair, blue eyes) = P(blond hair)p(blue eyes) In order to test whether two traits are independent, we need to calculate what we would expect to observe if the traits were independent. The following calculations allow us to calculate what we expect to see under the null hypothesis of independence. 7 / 85

8 Probabilities under the hypothesis of independence In general, let n i, be the sum of the entries in the i-th row and n,j be the sum of the entries in the j-th column. Let n be the total number of observations (the sum of the entries in the cells). Hence, here n 1, = n 2, = 150 (the sums of entries in first and second rows, respectively). Also, n,1 = 50, n,2 = 150, n,3 = 100 (sum of entries in columns 1,2 and 3, respectively). n=300 (i.e. there are 300 observations in total) 8 / 85

9 Probabilities under the hypothesis of independence The probability of an entry being in the i-th row is p i, = n i, n. The probability of an entry being in the j-th column is p,j = n,j n. If the two traits considered are independent, then the probability of an entry being in the cell in the i-th row and j-th column is p i,j, where p i,j = p i, p,j = n i, n n,j n = n i, n,j n 2 9 / 85

10 Expected number of observations in a cell under H 0 Under the null hypothesis, we expect e i,j observations in cell (i, j), where e i,j = np i,j Hence, we can calculate the expected number of observations in each cell e i,j = np i,j = n i, n,j, n i.e. the expected number is the row sum times the column sum divided by the total number of observations. 10 / 85

11 Table of expected values In the example considered the calculation is as follows: Red hair Blond hair Dark hair Blue eyes 300 = = = Brown eyes 300 = = = / 85

12 Comparison of observations and expectations (expectations in brackets) Red hair Blond hair Dark hair Blue eyes 30 (25) 90 (75) 30 (50) 150 Brown eyes 20 (25) 60 (75) 70 (50) It should be noted that the sum of the expectations in a row is equal to the sum of the observations. An analogous result holds for the columns. 12 / 85

13 The test statistic The test statistic is T = i,j (n i,j e i,j ) 2 e i,j, where the summation is carried out over all the cells of the contingency table. The realisation of this statistic is labelled t. This is a measure of the distance of our observations from those we expect under H 0. It should be noted that if the null hypothesis is true, then n i,j and e i,j are likely to be similar, hence the realisation t will tend to be close to 0 (by definition this realisation is non-negative). Large values of t indicate that the traits are dependent. 13 / 85

14 Calculation of the realisation of the test statistic In this case, (30 25)2 (90 75)2 (30 50)2 t= (20 25)2 (60 75)2 (70 50) = / 85

15 Distribution of the test statistic under H 0 Given the traits are independent, the test statistic has an approximate chi-squared distribution with (r 1) (c 1) degrees of freedom, where r and c are the numbers of rows and columns, respectively. It should be noted that this approximation is reasonable if at least 5 observations are expected in each cell under H / 85

16 Making conclusions Since large values of t indicate that the traits are dependent, we reject the null hypothesis of independence at a significance level of α if t > χ 2 (r 1)(c 1),α, where χ 2 (r 1)(c 1),α is the critical value for the appropriate chi-squared distribution. These values can be read from Table 8 in the Murdoch and Barnes book. 16 / 85

17 Making conclusions In this case at a significance level of 0.1% χ 2 (r 1)(c 1),α = χ2 2,0.001 = Since t = 24 > χ 2 2,0.001 = , we reject the null hypothesis of independence. Since we rejected at a significance level of 0.1%, we have very strong evidence that eye and hair colour are dependent. 17 / 85

18 Describing the nature of an association In order to see the nature of the association between hair and eye colour, we should compare the observed and expected values. Comparing the table of observed and expected values, it can be seen that dark hair is associated with brown eyes and blond hair with blue eyes (in both cases there are more observations than expected under the null hypothesis). 18 / 85

19 Simpsons Paradox The nature of an association between two traits may change and even reverse direction when the data from several groups is combined into a single group. For example, consider the following data regarding admissions to a university. The Physics department accepted 60 applications from 100 males and 20 applications from 30 females. The Fine Arts department accepted 10 applications from 100 males and 30 applications from 170 females. 19 / 85

20 Simpson s paradox Hence, the Physics department accepted 60% of applications from males and 66.7% of applications from females. The Fine Arts department accepted 10% of applications from males and 17.6% of applications from females. Hence, in both departments females are slightly more successful. 20 / 85

21 Simpson s paradox Combining the departments, we obtain the following contingency table Accepted Rejected Male Female Combining the results females are less successful. We now test the null hypothesis that acceptance is independent of sex. The alternative is that acceptance is associated with sex (i.e. the likelihood of acceptance depends on sex). 21 / 85

22 Simpson s paradox The table of expectations is as follows Accepted Rejected Male 400 = = Female = = Comparing the table of observations and expectations Accepted Rejected Male 70 (60) 130 (140) 200 Female 50 (60) 150 (140) / 85

23 Simpson s paradox The test statistic is T = i,j (n i,j e i,j ) 2 e i,j The realisation is given by t = (70 60) ( ) (50 60)2 ( ) / 85

24 Simpson s paradox Testing the null hypothesis at the 5% level, the critical value is χ 2 (r 1)(c 1),α = χ2 1,0.05 = Since t > χ 2 1,0.05 = 3.841, we reject the null hypothesis. We conclude that acceptance is associated with sex. Looking at the table which compares the observed and expected values, females are less likely to be accepted than males. Hence, using this data we might conclude that there is evidence that females are discriminated against. 24 / 85

25 Simpson s paradox However, in both departments a female is more likely to be accepted. On average, females are less likely to be accepted since they are more likely to apply to the fine arts department and the fine arts department rejects a higher proportion of applications. 25 / 85

26 Lurking variables In this case, when we pool the data, department is what we call a hidden or lurking variable. A hidden or lurking variable is a variable which may influence the observed results, but is not considered in the analysis. For example, women s salaries are significantly lower on average than men s salaries. 1. Can this be used as evidence that women are discriminated against? 2. What are the lurking variables which may explain such a difference? 3. How should we test whether females are discriminated against with regard to salary? 26 / 85

27 7.2 Regression and Prediction Suppose we have n pairs of observations of quantitative variables (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ) e.g. X i and Y i may represent the height and weight of the i-th person in a sample, respectively. 27 / 85

28 Scatter plots The relationship between these two variables may be illustrated graphically using a scatter plot 28 / 85

29 Types of relationship We distinguish between 3 types of relationship. 1. Linear relationship. 2. Non-linear (curvilinear) relationship. In this case, exponential growth and seasonal variation will be important types of relationship. 3. No relationship (independence) These relationships are illustrated on the following slides. 29 / 85

30 A linear relationship 30 / 85

31 A non-linear relationship 31 / 85

32 Independent traits 32 / 85

33 Does association imply cause and effect? It should be noted that a relationship (association) between two variables does not necessarily mean that there is a cause-effect mechanism at work between the variables. Also, correlation or regression analysis can only detect associations. They cannot determine what is the cause and what is the effect. For example, there is a negative correlation between flu infections and the sales of ice cream. Of course this does not mean that ice cream gives protection against being infected by flu. This correlation arises because as the temperature increases, ice cream sales increase and the number of flu infections decreases. 33 / 85

34 Cause-effect diagrams The relationship considered may be illustrated by the following cause-effect diagram Temperature Ice Cream Sales Flu Infections i.e. the weather affects both ice cream sales and the number of flu infections. 34 / 85

35 Criteria for establishing causation 1. An association should be found in numerous studies which consider various factors (this eliminates the possibility that a lurking variable is the cause). 2. The cause must precede the effect. 3. Established physical laws: i.e. A force causes an acceleration. If I heat a gas, it will cause an increase in pressure (this is related to 1). 35 / 85

36 Outliers A scatter plot may indicate that certain observations are outliers. Such observations may result from a mistake in recording the observations (see diagram). 36 / 85

37 Outliers Analysis should be carried out both with and without outliers (this indicates how robust the analysis is to possible mistakes). 37 / 85

38 7.2.1 Correlation coefficients A coefficient of the correlation between two variables X and Y, denoted r(x, Y ), is a measure of the strength of the relationship between X and Y. Both X and Y should be quantitative variables. 38 / 85

39 Properties of a correlation coefficient 1. 1 r(x, Y ) If r(x, Y ) = 0, then there is no general tendency for Y to increase as X increases (this does not mean that there is no relationship between X and Y ). 3. A correlation coefficient is independent of the units of measurement. 39 / 85

40 Properties of the correlation coefficient 4. If r(x, Y ) > 0, then there is a tendency for Y to increase when X increases. 5. If r(x, Y ) < 0, then there is a tendency for Y to decrease as X increases. 6. Values of r(x, Y ) close to 1 or -1 indicate a strong relationship between X and Y (note that values close to 0 do not necessarily indicate a weak relationship between X and Y ). 40 / 85

41 Estimation of the population correlation coefficient In practice, we estimate the population correlation coefficient based on a sample using either Pearson s or Spearman s correlation coefficient. This correlation coefficient will have some distribution centred around the appropriate population (i.e. true) correlation coefficient. 41 / 85

42 Use of Pearson s correlation coefficient We should use Pearson s correlation coefficient when 1. The distributions of both variables are close to normal (we can check this by looking at the histograms for both of the variables). 2. The relationship between the variables can be assumed to be linear (we can check this by looking at a scatter plot for the pair of variables). Otherwise, we should use Spearman s correlation coefficient. r P (X, Y ) and r S (X, Y ) will be used to denote Pearson s and Spearman s correlation coefficient between X and Y, respectively. 42 / 85

43 An additional property of Pearson s correlation coefficient In addition to the 6 general properties listed above, Pearson s correlation coefficient has the following property If r P (X, Y ) = 1, then the relationship between X and Y is an exact linear relationship, i.e. Y = a + bx. In this case, if r P (X, Y ) = 1, then b > 0 (i.e. a positive relationship) and if r P (X, Y ) = 1, then b < 0 (i.e. a negative relationship). Pearson s correlation coefficient should not be used to measure the strength of an obviously non-linear relationship. 43 / 85

44 Properties of Spearman s correlation coefficient In addition to the 6 general properties listed above, Spearman s correlation coefficient has the following properties. 1. Let R i and S i be the ranks of the observations X i and Y i, respectively (i.e. R i = k if X i is the k-th smallest of the X observations). Spearman s correlation coefficient is defined to be Pearson s correlation coefficient between the ranks R and S. 2. If r S (X, Y ) = 1, then there is a monotonically increasing relationship between X and Y and if r S (X, Y ) = 1, then there is a monotonically decreasing relationship between X and Y. The calculation of these correlation coefficients is numerically intensive and hence will only be calculated using the SPSS package. 44 / 85

45 A test of association based on the correlation coefficient We can test the null hypothesis against H 0 : r(x, Y ) = 0 H A : r(x, Y ) 0. H 0 is rejected if the correlation coefficient used is significantly different from 0. We only carry out these tests with the aid of SPSS. The test carried out corresponds to the correlation coefficient used. 45 / 85

46 Example 7.1 The following graph illustrates the distance travelled by a ball rolling down a hill in metres according to time measured in seconds. 46 / 85

47 Example 7.1 The relationship is clearly non-linear and hence the appropriate measure of correlation is Spearman s. From the Correlate menu in SPSS, we highlight the option to calculate Spearman s correlation coefficient (Pearson s correlation coefficient is calculated by default). The results are illustrated on the next slide 47 / 85

48 Example / 85

49 Example 7.1 Note that any correlation coefficient between X and itself is by definition equal to one. Spearman s correlation coefficient between time and distance is equal to 1. This is due to the fact that as time increases, the distance always increases, i.e. the relationship is monotonic and positive. Note that Pearson s correlation coefficient is close to one (0.976). Any positive, monotonic relationship will give a large positive value for Pearson s correlation coefficient. Hence, a Pearson s correlation coefficient close to, but not equal to, 1 or (-1) is not evidence of a linear relationship. 49 / 85

50 7.3 Linear Regression Linear regression is used to predict the value of a response variable Y given the value of a predictor variable X. It should be noted that the association between X and Y does not need to be a cause-effect relationship, but if there is such a relationship X should be defined to be the cause. 50 / 85

51 Linear regression Linear regression assumes that the real relation between the response variable y and the predictor variable x is of the form y i = β 0 + β 1 x i + ɛ i β 0 + β 1 x i is called the linear component of the model and ɛ i is the random component. This will be called the regression model. It is assumed that the random components are normally distributed with mean 0 (this distribution does not depend on x). 51 / 85

52 Linear regression The real (population) regression line is given by Y = β 0 + β 1 X. This gives the average value of the response variable given the value of the predictor variable i.e. when X = k the average value of the response variable is β 0 + β 1 k. The response variable will in general show variation around this mean value. 52 / 85

53 Interpretation of the parameter β 1 β 1 is the slope of the real regression line. A positive slope indicates that the response variable increases as the predictor variable increases e.g. if β 1 = 2 a unit increase in the predictor variable, causes on average an increase of 2 units in the response variable (note that β 0 and β 1 are sensitive to the units used). A negative slope indicates that the response variable decreases as the predictor variable increases e.g. if β 1 = 5 a unit increase in the predictor variable, causes on average an decrease of 5 units in the response variable. 53 / 85

54 Interpretation of the parameter β 0 β 0 is called the intercept [the graph of the real regression line crosses the y-axis at (0, β 0 )]. If the predictor variable x can take the value 0, then when x = 0, on average y = β / 85

55 Estimation of the regression line In practice, we estimate the parameters β 0 and β 1 by ˆβ 0 and ˆβ 1. The estimate of the regression line is Ŷ = ˆβ 0 + ˆβ 1 X. Given the predictor variable takes value x i, if we did not know the value of the response variable, we would estimate it to be ŷ i = ˆβ 0 + ˆβ 1 x i. 55 / 85

56 Estimation of the regression line It should be noted that this estimate is most accurate for observations of x which are close to the median observation of x. Such estimation should not be used for a value of x outside the range observed. 56 / 85

57 Errors in estimation The difference between an observation of the response variable and its estimate using the regression line is called the residual and is denoted r i i.e. r i = y i ŷ i, y i is the i-th observation of the response variable and ŷ i is its estimate based on the regression line. If r i > 0, then use of the regression line results in underestimation of y i. If r i < 0, then use of the regression line results in overestimation of y i. 57 / 85

58 7.3.1 Estimation of the regression parameters The parameters β 0 and β 1 are estimated using ˆβ 0 and ˆβ 1, such that 1. n i=1 r i 2 is minimised (i.e. the mean square error from estimating the observations of the response variable is minimised). 2. n i=1 r i = 0 (i.e. given the model is correct, there is no tendency to under- or overestimate the observations of the response variable). 58 / 85

59 Test of association based on regression analysis Given our estimate of the slope of the regression line we can carry out a test of the hypothesis against H 0 : β 1 = 0 H A : β 1 0. In words, H 0 may be interpreted as the hypothesis that there is no linear relation between X and Y (as X varies Y does not on average vary). H A may be interpreted as the hypothesis that there is a relation between X and Y such that as X rises the mean value of Y either systematically rises or systematically falls. H 0 is not equivalent to the hypothesis that there is no relationship between X and Y (see diagram on the next slide). 59 / 85

60 A non-linear relationship with non-significant regression slope 60 / 85

61 Interpretation of the results of a test of association As in the interpretation of a correlation coefficient, a significant relationship between two variables does not necessarily imply a cause-effect relationship. A lurking variable may be the cause of such a relationship. For example, weekly sales of ice cream will be negatively correlated with the number of flu infections, since hot weather leads to increased ice cream sales, but fewer flu infections. We only consider such tests in SPSS. 61 / 85

62 Example 7.2 We consider the data on height X and weight before studies Y given in lab1.sav. A scatter plot indicates that there is a clear positive relationship between height and weight. (as height increases, on average weight increases). From the scatter plot it seems that the assumption of a linear relationship between X and Y is reasonable. The histograms for height and weight show that the distribution of both of these variables is similar to the normal distribution. It follows that Pearson s correlation coefficient will be a good measure of the strength of the relationship between height and weight. 62 / 85

63 Scatter plot of the relationship between height and weight 63 / 85

64 Histogram of height 64 / 85

65 Histogram of weight 65 / 85

66 Example 7.2 There do not seem to be any significant outliers. The observation in the top right corner would be reasonably close to the straight line which best fits the data. An observation in the top left or bottom right corner would be an outlier. The Pearson correlation coefficient is Since this coefficient is large and positive, it indicates a strong positive relationship between the two variables (i.e. as height increases, weight on average increases). 66 / 85

67 Interpretation of the results from SPSS SPSS gives the following regression coefficients together with associated statistics for the relationship of height (in cms) with weight (in kgs) Model Unstandardised Coefficients B Std. Error t sig. (Constant) height The estimated regression model is Weight = Height. The coefficients are read from the B column. ˆβ0 = 48.97, ˆβ 1 = / 85

68 Interpretation of the results from SPSS Weight = Height. i.e. an increase in height of 1cm leads to an average increase in weight of 0.681kg. The sig. value in the bottom right hand corner (0.000) gives the p-value for the test H 0 : β 1 = 0 against H A : β / 85

69 Interpretation of the results from SPSS H 0 states that there is no linear relationship between height and weight. Since p = < 0.001, we reject H 0 at a significance level of 0.1%. We have very strong evidence that there is a relationship between height and weight. Since ˆβ 1 > 0 this indicates a positive relationship between height and weight (i.e. as height increases weight increases). 69 / 85

70 Interpretation of the results from SPSS We can use the estimated regression line to estimate the average weight for a given height. e.g. The average weight of people of height 170cm can be approximated by substituting X = 170 into the regression equation Ŷ = X = = 66.8kg This is a clearly reasonable approximation. Note 170 is very close to the average height. 70 / 85

71 Interpretation of the results from SPSS Using this regression line to estimate the average weight of 50cm high individuals Ŷ = X = = 14.9kg This is clearly a nonsensical estimate. This is due to the fact that 50cm is well outside the range of heights we observe. In such cases approximation using the regression line will be inappropriate. Note that although the relationship may be reasonably linear over the range observed, it may not be linear over a larger range. If the relationship is clearly non-linear, such approximation will always be inappropriate. 71 / 85

72 7.4 Transforming data in order to obtain a linear relationship - Exponential growth Assume that there is a non-linear relationship between X and Y. It may be possible to find a function U = g(y ) such that there is a linear relationship between U and X. In this case, we can carry out linear regression for U on X, obtaining a model U = g(y ) = a + bx. Transforming this equation, we can find the regression equation for Y as some non-linear function of X. 72 / 85

73 Exponential Growth Exponential growth is a very important relationship in the financial sector, e.g. the value of investments and prices. We can model the value of an investment, Y, as an exponential function of time, X. Such a model states that if the value of an investment at time 0 is β 0, then its expected value at time X is of the form Y = β 0 e β 1X, where β 1 is the expected interest rate on the investment. Taking logarithms, we obtain ln Y = ln(β 0 ) + β 1 X. Hence, there is a linear relationship between X and U = ln Y. 73 / 85

74 Example Exponential Regression The nominal annual GDP of Ireland relative to GDP in 1950 (year 0) is given below. Derive an appropriate regression equation for GDP as a function of time in years. Year GDP The graph on the next slide shows a scatter plot of GDP (y-axis) as a function of time (x-axis). 74 / 85

75 Example Exponential Regression - Scatter Plot of Data 75 / 85

76 Example Exponential Regression It can be seen that the relationship between GDP and time is not linear (it is upwards curving). This suggests that the relationship between time and GDP is exponential (often nominal monetary values show exponential growth). For this reason, we take the logarithms of the dependent variable (GDP). This gives us the table on the next slide. 76 / 85

77 Example Exponential Regression Year GDP ln(gdp) The graph on the next slide shows a scatter plot of ln(gdp) (y-axis) as a function of time (x-axis). 77 / 85

78 Example Exponential Regression 78 / 85

79 Example Exponential Regression This scatterplot shows that it is reasonable to assume that the relationship between time and ln(gdp) is linear. Hence, we now carry out linear regression to find an expression for ln(gdp) as a function of time. The results of the analysis from SPSS are given on the next slide. 79 / 85

80 Example Exponential Regression 80 / 85

81 Example Exponential Regression The model for ln(gdp) as a function of time is ln(gdp) = time. Since we want a model for GDP, we invert the operation of taking logs by exponentiating. It follows that GDP=exp( time) =exp( 0.006) exp(0.021time) =0.994 exp(0.021time) 81 / 85

82 Example Exponential Regression Using this equation we can estimate relative GDP in 2010 (year 60). GDP = exp( ) = Each year GDP increases by a factor of exp(β 1 ). Thus, our estimate of the annual growth rate per unit time (here a year) is exp(β 1 ) 1. Hence, the average growth rate per annum is estimated to be e = , i.e. 2.12% per annum. 82 / 85

83 Example Exponential Regression Note that since GDP is given relative to GDP in 1950 (year 0), ln(gdp) in year 0 is by definition zero. In such cases, we can leave out the intercept (constant) in the regression equation for ln(gdp). In SPSS this can be done by clicking on the options menu in the regression window and unchecking the option Include constant in the equation. 83 / 85

84 Example Exponential Regression 84 / 85

85 Example Exponential Regression The model obtained in this way is ln(gdp) = 0.021time. Exponentiating, we obtain GDP = exp(0.021time). The estimate of the relative GDP in 2010 (year 60) is given by GDP = exp( ) = / 85

### Simple linear regression

Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

### Correlation key concepts:

CORRELATION Correlation key concepts: Types of correlation Methods of studying correlation a) Scatter diagram b) Karl pearson s coefficient of correlation c) Spearman s Rank correlation coefficient d)

### Inferential Statistics

Inferential Statistics Sampling and the normal distribution Z-scores Confidence levels and intervals Hypothesis testing Commonly used statistical methods Inferential Statistics Descriptive statistics are

### Pearson s correlation

Pearson s correlation Introduction Often several quantitative variables are measured on each member of a sample. If we consider a pair of such variables, it is frequently of interest to establish if there

### 8. Time Series and Prediction

8. Time Series and Prediction Definition: A time series is given by a sequence of the values of a variable observed at sequential points in time. e.g. daily maximum temperature, end of day share prices,

### Lesson Lesson Outline Outline

Lesson 15 Linear Regression Lesson 15 Outline Review correlation analysis Dependent and Independent variables Least Squares Regression line Calculating l the slope Calculating the Intercept Residuals and

### We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries?

Statistics: Correlation Richard Buxton. 2008. 1 Introduction We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries? Do

### Introduction to Regression and Data Analysis

Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

### Once saved, if the file was zipped you will need to unzip it.

1 Commands in SPSS 1.1 Dowloading data from the web The data I post on my webpage will be either in a zipped directory containing a few files or just in one file containing data. Please learn how to unzip

### X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

### Simple Linear Regression in SPSS STAT 314

Simple Linear Regression in SPSS STAT 314 1. Ten Corvettes between 1 and 6 years old were randomly selected from last year s sales records in Virginia Beach, Virginia. The following data were obtained,

### The aspect of the data that we want to describe/measure is the degree of linear relationship between and The statistic r describes/measures the degree

PS 511: Advanced Statistics for Psychological and Behavioral Research 1 Both examine linear (straight line) relationships Correlation works with a pair of scores One score on each of two variables ( and

### Spearman s correlation

Spearman s correlation Introduction Before learning about Spearman s correllation it is important to understand Pearson s correlation which is a statistical measure of the strength of a linear relationship

### Section 3 Part 1. Relationships between two numerical variables

Section 3 Part 1 Relationships between two numerical variables 1 Relationship between two variables The summary statistics covered in the previous lessons are appropriate for describing a single variable.

### Chapter 13 Introduction to Linear Regression and Correlation Analysis

Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing

### 3.4 Statistical inference for 2 populations based on two samples

3.4 Statistical inference for 2 populations based on two samples Tests for a difference between two population means The first sample will be denoted as X 1, X 2,..., X m. The second sample will be denoted

### Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

### Regression and Correlation

Regression and Correlation Topics Covered: Dependent and independent variables. Scatter diagram. Correlation coefficient. Linear Regression line. by Dr.I.Namestnikova 1 Introduction Regression analysis

### DATA INTERPRETATION AND STATISTICS

PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE

### Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

Lesson 4 Part 1 Relationships between two numerical variables 1 Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables

### SIMPLE REGRESSION ANALYSIS

SIMPLE REGRESSION ANALYSIS Introduction. Regression analysis is used when two or more variables are thought to be systematically connected by a linear relationship. In simple regression, we have only two

### SPSS: Descriptive and Inferential Statistics. For Windows

For Windows August 2012 Table of Contents Section 1: Summarizing Data...3 1.1 Descriptive Statistics...3 Section 2: Inferential Statistics... 10 2.1 Chi-Square Test... 10 2.2 T tests... 11 2.3 Correlation...

### Statistics and research

Statistics and research Usaneya Perngparn Chitlada Areesantichai Drug Dependence Research Center (WHOCC for Research and Training in Drug Dependence) College of Public Health Sciences Chulolongkorn University,

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

### Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

### The Dummy s Guide to Data Analysis Using SPSS

The Dummy s Guide to Data Analysis Using SPSS Mathematics 57 Scripps College Amy Gamble April, 2001 Amy Gamble 4/30/01 All Rights Rerserved TABLE OF CONTENTS PAGE Helpful Hints for All Tests...1 Tests

### Chapter 14: Analyzing Relationships Between Variables

Chapter Outlines for: Frey, L., Botan, C., & Kreps, G. (1999). Investigating communication: An introduction to research methods. (2nd ed.) Boston: Allyn & Bacon. Chapter 14: Analyzing Relationships Between

### Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

### Linear Regression. Chapter 5. Prediction via Regression Line Number of new birds and Percent returning. Least Squares

Linear Regression Chapter 5 Regression Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y). We can then predict the average response for all subjects

### Simple Predictive Analytics Curtis Seare

Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

### Chapter 10 - Practice Problems 1

Chapter 10 - Practice Problems 1 1. A researcher is interested in determining if one could predict the score on a statistics exam from the amount of time spent studying for the exam. In this study, the

### 1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

### Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Doing Multiple Regression with SPSS Multiple Regression for Data Already in Data Editor Next we want to specify a multiple regression analysis for these data. The menu bar for SPSS offers several options:

### REGRESSION LINES IN STATA

REGRESSION LINES IN STATA THOMAS ELLIOTT 1. Introduction to Regression Regression analysis is about eploring linear relationships between a dependent variable and one or more independent variables. Regression

### 11. Analysis of Case-control Studies Logistic Regression

Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

### Association Between Variables

Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi

### Sydney Roberts Predicting Age Group Swimmers 50 Freestyle Time 1. 1. Introduction p. 2. 2. Statistical Methods Used p. 5. 3. 10 and under Males p.

Sydney Roberts Predicting Age Group Swimmers 50 Freestyle Time 1 Table of Contents 1. Introduction p. 2 2. Statistical Methods Used p. 5 3. 10 and under Males p. 8 4. 11 and up Males p. 10 5. 10 and under

### Mind on Statistics. Chapter 3

Mind on Statistics Chapter 3 Section 3.1 1. Which one of the following is not appropriate for studying the relationship between two quantitative variables? A. Scatterplot B. Bar chart C. Correlation D.

### Course Objective This course is designed to give you a basic understanding of how to run regressions in SPSS.

SPSS Regressions Social Science Research Lab American University, Washington, D.C. Web. www.american.edu/provost/ctrl/pclabs.cfm Tel. x3862 Email. SSRL@American.edu Course Objective This course is designed

### CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the

### Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship

### Bivariate Analysis. Correlation. Correlation. Pearson's Correlation Coefficient. Variable 1. Variable 2

Bivariate Analysis Variable 2 LEVELS >2 LEVELS COTIUOUS Correlation Used when you measure two continuous variables. Variable 2 2 LEVELS X 2 >2 LEVELS X 2 COTIUOUS t-test X 2 X 2 AOVA (F-test) t-test AOVA

### NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

### Module 5: Multiple Regression Analysis

Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College

### Module 9: Nonparametric Tests. The Applied Research Center

Module 9: Nonparametric Tests The Applied Research Center Module 9 Overview } Nonparametric Tests } Parametric vs. Nonparametric Tests } Restrictions of Nonparametric Tests } One-Sample Chi-Square Test

### SPSS TUTORIAL & EXERCISE BOOK

UNIVERSITY OF MISKOLC Faculty of Economics Institute of Business Information and Methods Department of Business Statistics and Economic Forecasting PETRA PETROVICS SPSS TUTORIAL & EXERCISE BOOK FOR BUSINESS

### DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,

### IBM SPSS Statistics for Beginners for Windows

ISS, NEWCASTLE UNIVERSITY IBM SPSS Statistics for Beginners for Windows A Training Manual for Beginners Dr. S. T. Kometa A Training Manual for Beginners Contents 1 Aims and Objectives... 3 1.1 Learning

### Exercise 1.12 (Pg. 22-23)

Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

### Directions for using SPSS

Directions for using SPSS Table of Contents Connecting and Working with Files 1. Accessing SPSS... 2 2. Transferring Files to N:\drive or your computer... 3 3. Importing Data from Another File Format...

### Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

### Example: Boats and Manatees

Figure 9-6 Example: Boats and Manatees Slide 1 Given the sample data in Table 9-1, find the value of the linear correlation coefficient r, then refer to Table A-6 to determine whether there is a significant

### Appendix E: Graphing Data

You will often make scatter diagrams and line graphs to illustrate the data that you collect. Scatter diagrams are often used to show the relationship between two variables. For example, in an absorbance

### Outline of Topics. Statistical Methods I. Types of Data. Descriptive Statistics

Statistical Methods I Tamekia L. Jones, Ph.D. (tjones@cog.ufl.edu) Research Assistant Professor Children s Oncology Group Statistics & Data Center Department of Biostatistics Colleges of Medicine and Public

### Chapter 23. Inferences for Regression

Chapter 23. Inferences for Regression Topics covered in this chapter: Simple Linear Regression Simple Linear Regression Example 23.1: Crying and IQ The Problem: Infants who cry easily may be more easily

### Yiming Peng, Department of Statistics. February 12, 2013

Regression Analysis Using JMP Yiming Peng, Department of Statistics February 12, 2013 2 Presentation and Data http://www.lisa.stat.vt.edu Short Courses Regression Analysis Using JMP Download Data to Desktop

### San Jose State University Engineering 10 1

KY San Jose State University Engineering 10 1 Select Insert from the main menu Plotting in Excel Select All Chart Types San Jose State University Engineering 10 2 Definition: A chart that consists of multiple

### Interaction between quantitative predictors

Interaction between quantitative predictors In a first-order model like the ones we have discussed, the association between E(y) and a predictor x j does not depend on the value of the other predictors

### II. DISTRIBUTIONS distribution normal distribution. standard scores

Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

### Chapter 5 Analysis of variance SPSS Analysis of variance

Chapter 5 Analysis of variance SPSS Analysis of variance Data file used: gss.sav How to get there: Analyze Compare Means One-way ANOVA To test the null hypothesis that several population means are equal,

### Diagrams and Graphs of Statistical Data

Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in

### Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Spring 204 Class 9: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.) Big Picture: More than Two Samples In Chapter 7: We looked at quantitative variables and compared the

### Simple Regression Theory II 2010 Samuel L. Baker

SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

### The Correlation Coefficient

The Correlation Coefficient Lelys Bravo de Guenni April 22nd, 2015 Outline The Correlation coefficient Positive Correlation Negative Correlation Properties of the Correlation Coefficient Non-linear association

### Regression Analysis: A Complete Example

Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty

### A Guide for a Selection of SPSS Functions

A Guide for a Selection of SPSS Functions IBM SPSS Statistics 19 Compiled by Beth Gaedy, Math Specialist, Viterbo University - 2012 Using documents prepared by Drs. Sheldon Lee, Marcus Saegrove, Jennifer

### MTH 140 Statistics Videos

MTH 140 Statistics Videos Chapter 1 Picturing Distributions with Graphs Individuals and Variables Categorical Variables: Pie Charts and Bar Graphs Categorical Variables: Pie Charts and Bar Graphs Quantitative

### Numerical Methods Lecture 5 - Curve Fitting Techniques

Numerical Methods Lecture 5 - Curve Fitting Techniques Topics motivation interpolation linear regression higher order polynomial form exponential form Curve fitting - motivation For root finding, we used

Using Your TI-83/84 Calculator: Linear Correlation and Regression Elementary Statistics Dr. Laura Schultz This handout describes how to use your calculator for various linear correlation and regression

### MULTIPLE REGRESSION EXAMPLE

MULTIPLE REGRESSION EXAMPLE For a sample of n = 166 college students, the following variables were measured: Y = height X 1 = mother s height ( momheight ) X 2 = father s height ( dadheight ) X 3 = 1 if

### 2. Simple Linear Regression

Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

### Chapter 9. Section Correlation

Chapter 9 Section 9.1 - Correlation Objectives: Introduce linear correlation, independent and dependent variables, and the types of correlation Find a correlation coefficient Test a population correlation

### The correlation coefficient

The correlation coefficient Clinical Biostatistics The correlation coefficient Martin Bland Correlation coefficients are used to measure the of the relationship or association between two quantitative

### SPSS for Exploratory Data Analysis Data used in this guide: studentp.sav (http://people.ysu.edu/~gchang/stat/studentp.sav)

Data used in this guide: studentp.sav (http://people.ysu.edu/~gchang/stat/studentp.sav) Organize and Display One Quantitative Variable (Descriptive Statistics, Boxplot & Histogram) 1. Move the mouse pointer

Using Your TI-NSpire Calculator: Linear Correlation and Regression Dr. Laura Schultz Statistics I This handout describes how to use your calculator for various linear correlation and regression applications.

### Chapter 5. Regression

Topics covered in this chapter: Chapter 5. Regression Adding a Regression Line to a Scatterplot Regression Lines and Influential Observations Finding the Least Squares Regression Model Adding a Regression

### Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

2 Making Connections: The Two-Sample t-test, Regression, and ANOVA In theory, there s no difference between theory and practice. In practice, there is. Yogi Berra 1 Statistics courses often teach the two-sample

### Age to Age Factor Selection under Changing Development Chris G. Gross, ACAS, MAAA

Age to Age Factor Selection under Changing Development Chris G. Gross, ACAS, MAAA Introduction A common question faced by many actuaries when selecting loss development factors is whether to base the selected

### EXCEL EXERCISE AND ACCELERATION DUE TO GRAVITY

EXCEL EXERCISE AND ACCELERATION DUE TO GRAVITY Objective: To learn how to use the Excel spreadsheet to record your data, calculate values and make graphs. To analyze the data from the Acceleration Due

### CHAPTER 11 CHI-SQUARE: NON-PARAMETRIC COMPARISONS OF FREQUENCY

CHAPTER 11 CHI-SQUARE: NON-PARAMETRIC COMPARISONS OF FREQUENCY The hypothesis testing statistics detailed thus far in this text have all been designed to allow comparison of the means of two or more samples

### Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable

### SPSS Guide: Regression Analysis

SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar

### Regression Analysis. Data Calculations Output

Regression Analysis In an attempt to find answers to questions such as those posed above, empirical labour economists use a useful tool called regression analysis. Regression analysis is essentially a

### Session 7 Bivariate Data and Analysis

Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares

### Introduction to Stata

Introduction to Stata September 23, 2014 Stata is one of a few statistical analysis programs that social scientists use. Stata is in the mid-range of how easy it is to use. Other options include SPSS,

### ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.

### Poisson Models for Count Data

Chapter 4 Poisson Models for Count Data In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the

### 4. Joint Distributions of Two Random Variables

4. Joint Distributions of Two Random Variables 4.1 Joint Distributions of Two Discrete Random Variables Suppose the discrete random variables X and Y have supports S X and S Y, respectively. The joint

### Regression analysis in practice with GRETL

Regression analysis in practice with GRETL Prerequisites You will need the GNU econometrics software GRETL installed on your computer (http://gretl.sourceforge.net/), together with the sample files that

### hp calculators HP 50g Trend Lines The STAT menu Trend Lines Practice predicting the future using trend lines

The STAT menu Trend Lines Practice predicting the future using trend lines The STAT menu The Statistics menu is accessed from the ORANGE shifted function of the 5 key by pressing Ù. When pressed, a CHOOSE

### Stats Review Chapters 3-4

Stats Review Chapters 3-4 Created by Teri Johnson Math Coordinator, Mary Stangler Center for Academic Success Examples are taken from Statistics 4 E by Michael Sullivan, III And the corresponding Test

### AP STATISTICS REVIEW (YMS Chapters 1-8)

AP STATISTICS REVIEW (YMS Chapters 1-8) Exploring Data (Chapter 1) Categorical Data nominal scale, names e.g. male/female or eye color or breeds of dogs Quantitative Data rational scale (can +,,, with

### Variables and Data A variable contains data about anything we measure. For example; age or gender of the participants or their score on a test.

The Analysis of Research Data The design of any project will determine what sort of statistical tests you should perform on your data and how successful the data analysis will be. For example if you decide

### Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

### Review of Fundamental Mathematics

Review of Fundamental Mathematics As explained in the Preface and in Chapter 1 of your textbook, managerial economics applies microeconomic theory to business decision making. The decision-making tools

### Numerical Summarization of Data OPRE 6301

Numerical Summarization of Data OPRE 6301 Motivation... In the previous session, we used graphical techniques to describe data. For example: While this histogram provides useful insight, other interesting

### Using SPSS, Chapter 2: Descriptive Statistics

1 Using SPSS, Chapter 2: Descriptive Statistics Chapters 2.1 & 2.2 Descriptive Statistics 2 Mean, Standard Deviation, Variance, Range, Minimum, Maximum 2 Mean, Median, Mode, Standard Deviation, Variance,