Stata Walkthrough 4: Regression, Prediction, and Forecasting
|
|
|
- Colin Matthews
- 10 years ago
- Views:
Transcription
1 Stata Walkthrough 4: Regression, Prediction, and Forecasting Over drinks the other evening, my neighbor told me about his 25-year-old nephew, who is dating a 35-year-old woman. God, I can t see them getting married, he said. I raised my eyebrow, because as an economic demographer, I know that spouses ages are very predictable: they tend to be similar, with the husband just a couple of years older than his wife. Strange cases do occur, but they tend to involve older men and younger women. The opposite is fairly unlikely. However, I didn t know exactly what would constitute a range of likely outcomes for the age of the woman that a 25-year-old guy would marry. In this Stata exercise, we re going to do two things: 1. We will use data on married couples ages to estimate the relationship between spouses ages. 2. We will use this to predict a range of likely outcomes for the age of the wife of a twenty-five-year-old man. You ll need to load the database of U.S. married couples, March 2005 (marriedmar05.dta) from my webpage. These data come from the Current Population Survey (done by the Bureau of Labor Statistics), and they are a random sample of all households in the U.S. I have restricted the sample to only couples that identify themselves as married. We have 34,674 of these couples. In general, you should begin any project by exploring the data. In this case, that s simple, since this database contains two variables: hage and wage, the husband s age and the wife s age. First, let s look at some descriptive statistics: type summarize. Variable Obs Mean Std. Dev. Min Max wage hage Values of each variable range from 15 to 79. The mean age of a married man is 48.1 years, and the mean age of a married woman is 45.8 years. We can see that there s a strong correlation between the two variables by typing corr hage wage:
2 wage hage wage hage In other words, older men tend to be married to older women this should be no surprise. However, the sample means reveal that married men are older than married women on average, so we can also infer that typically each husband must be older than his wife. We might be interested in the distribution of the age difference, so let s create a new variable for the difference in ages: gen dage = hage wage label variable dage "Difference between spouses' ages" Now let s look at the frequency distribution of this variable by typing histogram dage The image is a bit ugly. First, the graph goes from -50 to +50, even though there s basically nobody out in those tails. (There are some people, but they re not significant enough to register in the graph.) For all practical purposes, the range goes from -20 to +30, so we ll restrict it to that. Second, Stata is using a default bin-width of 2.44 units. There s a natural width in this case, I d think: one, a single year. Let s tell Stata to make each bin of width one. Third, there aren t nearly enough tick marks or labels on the axes to give us a clear indication of the values, so we ll add those. Finally, I think it s easier to interpret percentages than frequencies, so we ll specify that option. Here s the complete command and the output: histogram dage if dage >= -20 & dage <= 30, width(1) ylabel( ) ytick(1(1)15) xlabel(-20 (10) 30) xtick(-20(2) 30) percent There s still one small problem: the bins start just to the right of the corresponding number. We can correct this by specifying start(-20.5) in our list of options; that way, Stata would place the bin for to (including the values of -20) right over the number -20 in the graph, and so forth. (To do this correction, click on the last command that you typed, in the Review window; it should appear in your
3 command line, and you can tack on the extra option.) The final output should look this: Percent Difference in spouses' age (husband's minus wife's) You can see the asymmetry in the distribution: the frequency of -1 (men who are one year younger than their wives) is roughly the same as the frequency of +3 (men who are three years older than their wives). The frequency of -2 is about the same as the frequency of +6. Finally, only a small number (0.33%) of men are ten years younger than their wives (the match that my neighbor worries about). I knew that it was unlikely. This isn t a perfect answer to the question, however it tells us the chance for the overall population. We have information about the specific age of this person (he is twenty-five), and we should use that knowledge to improve our predictions. The relationship might not be the same for all ages. For example, it could be the case that men typically marry women almost exactly their own age initially; then, when they get to a midlife crisis (or a seven-year-itch, whichever comes first), they trade in their spouse for a younger model. That could generate something like the distribution that we observe; however, it would be inaccurate to claim that the distribution of wives ages is the same for men at all ages. After all, our data says that
4 7.0% of men are ten years older than their wives but surely this isn t true for seventeen-year-old grooms? We want to predict the wife s age as a function of the husband s age. We will assume that the relationship can be described as: wage i =! 0 +! 1 " hage i + e i One note: when we re doing prediction, we don t necessarily have to believe that the husband s age causes the wife s age (even though we model it that way). From the descriptive statistics, we have enough information to calculate our best prediction of ˆ! 1 and ˆ! 0 : s y " ˆ! 1 = r xy = (0.9243) % s x # $ & ' = ˆ! 0 = Y " ˆ! 1 X = " (0.8998) = However, it s generally easier to let Stata do the work. Type reg wage hage, and you should get the output: Source SS df MS Number of obs = F( 1, 34672) =. Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = wage Coef. Std. Err. t P> t [95% Conf. Interval] hage _cons Let s look first at the bottom portion of this table. The estimated coefficients ( ˆ! 1 and ˆ! 0 ) are exactly what we found using the formulas. In addition, Stata calculates the standard error in each estimate ( s ˆ!1 and s ˆ!0, in the notation from class). At the right-hand side, it gives us a 95% confidence interval for each parameter. We can very quickly test a (two-tailed) hypothesis using these. In the top-right of the
5 table, we have the R 2 value of the regression: This means that husbands ages can predict 85.43% of the variation in wives ages.. (Remember that the R 2 is also equal to the model sum of squares divided by the total sum of squares: / = We can also calculate the R 2 value as simply the correlation coefficient, squared? Well, (0.9243) 2 = ) Finally, in the top-left, we have the model (explained), residual, and total sum of squares. Next, we want to predict the expected age of each person s spouse. There are three things that we might want to know: the best single guess of the wife s average or actual age (these are the same), a range of estimates for the average age, or a range of estimates for the actual age. We can obtain a point estimate of the wife s age for each person by using the formula: wagehat i = ˆ! 0 + ˆ! 1 " hage i To do this in Stata, we could type: gen wagehat = * hage This is a bit time-consuming and error-prone, so Stata has a pre-programmed shortcut for many post-estimation predictions. You can read help regress postestimation for a complete description, but let me give you a partial list: Command predict varname1, xb predict varname2, r predict varname3, stdp predict varname4, stdf What it does Calculates ŷ i for each observation, and saves it to a new variable called varname1 Calculates the residual ê i = y i! ŷ i for each observation, and saves it to a new variable called varname2 Calculates the standard deviation in prediction sŷ for each observation, and saves it to a new variable called varname3 Calculates the standard deviation in forecast s y! ŷ for each observation, and saves it to a new variable called
6 varname4 The names of the variables can be whatever you like, but the options (xb, r, stdp, and stdf) need to be stated just like this. In this case, we would type: predict wagehat2, xb Then Stata will create a new variable called wagehat2 with the predicted value of each wife s age. By typing corr wagehat wagehat2,you can check that these values are the same as the ones obtained by using the formula wagehat i = ! hage i : wagehat wagehat wagehat wagehat A correlation of means that the variables are exactly the same. With this variable, we will graph the predicted age of wives. Let s type: graph twoway (scatter wage hage) (line wagehat2 hage) If you want to know the predicted value for a man of a specific age (like my neighbor s nephew), you can type: summ wagehat2 if hage==25 This gives us the following information: Variable Obs Mean Std. Dev. Min Max wagehat This tells us that, on average, a 25-year-old man is married to a year-old woman. Even though men are typically married to women who are 2.42 years younger, 25-yearold men are not they are usually married to women of their same age (more or less). We also see that the standard deviation in the variable is zero. We can infer from this that the variable s value is the same for all people in this sample (that is, for all men aged twenty-five).
7 However, we might worry that our prediction is inaccurate, so we might want to form a confidence interval for this estimated average age. We know that the formula for this confidence interval is: ˆ Y i ± z! /2 (sŷ) where sŷ is the standard deviation of the prediction (given by some formula in the book). We can ask Stata to calculate the standard error of the prediction for us: predict syhat, stdp The option, stdp tells Stata that we want the standard deviation of the prediction; it creates a variable named syhat with this value (again, this name can be whatever you choose). A 95% confidence interval would be: gen lowerpred = wagehat * syhat gen upperpred = wagehat * syhat We can see these ranges if we create the graph: graph twoway (scatter wage hage) (line wagehat2 hage) (line lowerpred hage) (line upperpred hage) You might notice that the lower-bound and upper-bound lines almost coincide exactly with the prediction line. This is because we have estimated the model very precisely: we have a lot of data, we have a lot of variation in hage, and the hage is a good predictor of wage. As a consequence, we are fairly confident that we have estimated the prediction line accurately. (In practice, you will almost never find such close bounds for your predictions but spouse s age is one of the most predictable variables there is.) We can also obtain the upper and lower bounds of the 95% confidence interval for expected age of wife at any specific value of hage. To do this, we would type: summ lowerpred upperpred if hage==25 The output looks like this:
8 Variable Obs Mean Std. Dev. Min Max lowerpred upperpred Because the standard deviation is zero, this is simply a constant value for everyone in this group. For all men aged 25, the 95% confidence interval for average wife s age is from to (As a note, we would not be able to reject the hypothesis that a 25-year-old man married a woman of the same age, 25, on average.) Finally, we want to come up with a range of likely outcomes for the actual value our forecast interval. We know the formula for this: ˆ Y i ± z! /2 (s y" ŷ ) where s y! ŷ is the standard deviation of forecast. We have a formula for this, but Stata can also generate it for us: predict syminusyhat, stdf Now it is simple to construct a 95% confidence interval for the actual age of the spouse: gen lowerfore= wagehat * syminusyhat gen upperfore = wagehat * syminusyhat Let s generate a graph that shows these bounds for all ages: graph twoway (scatter wage hage) (line wagehat2 hage) (line lowerfore hage) (line upperfore hage) There are a few things that bother me about this graph. Given how many observations we have, the dots are so big that they become a meaningless jumble. We can specify the option msize(.2) to shrink the dots to 20% of their default size. Also, it annoys me a big that the forecast lines extend into implausible ranges (it suggests that a 16-year-old guy has a 7-year-old wife!). We don t observe this in the data, so I m going to cut off that portion of the line and similarly, where it goes over the values we observe in the data, at the top right. Finally, we should place a title on the vertical axis.
9 graph twoway (scatter wage hage, msize(.2) ytitle( Wife s Age )) (line wagehat2 hage) (line lowerfore hage if lowerfore>15) (line upperfore hage if upperfore<80) This looks a big better, but the data is still too dense to give us a clear idea of how many people are in the center of the distribution. I m going to add one more option: jitter(2), which will place each plot at a random location near its actual value. graph twoway (scatter wage hage, msize(.2) jitter(2)) (line wagehat2 hage) (line lowerfore hage if lowerfore>15) (line upperfore hage if upperfore<80) If you ve done this right, you should have an image that looks like a photograph of a galaxy, with a very high concentration of dots in a narrow band. Finally, let s fix the labels on the variables: label label label label variable variable variable variable wagehat2 Expected wife s age upperfore Upper bound on forecast lowerfore Lower bound on forecast hage Husband s age Now generate the graph again. We should get: Husband's age 60 Wife's actual age Expected wife's age Lower bound on forecast Upper bound on forecast 80
10 The final question we have is what is a likely range of age for the wife of a 25-yearold (and does it include 35)? We would type: summ upperfore lowerfore if hage==25 This gives us the following values: Variable Obs Mean Std. Dev. Min Max upper lower In other words, from years to The value of 35 is just outside this 95% confidence interval it could happen, but only about 5% of men would be in such a relationship. My initial intuition was correct: this would be an unlikely match. That wraps up this little exercise. I d like to add a few final notes about whether this analysis was entirely valid. The ordinary least squares regression model makes several assumptions. The first is that we have some variation in the explanatory variable: the husband s age, in this case. Clearly, that is true; we would not be able to compute ˆ! 1 otherwise. The second assumption is that unobservables e i are uncorrelated with the explanatory variable x i. In this case, you can think of the unobservable as marrying someone younger or older than we would predict. We are assuming that this does not tend to vary with the husband s age. That can t really be true for younger men: if someone is seventeen years old, then we predict that he should marry a woman who is years old. He could marry someone older (even substantially older) than this with no problem, but he cannot easily marry someone younger (especially substantially younger) because of legal restrictions. This means that young men will tend to marry only women who are older than we predict. I might argue that the same is true at the opposite end of the spectrum: we could imagine a 60-year-old man married to someone thirty years younger than we would expect, but it would be hard to imagine that he would be equally likely to be married to someone thirty years younger than we would predict. Because of these tendencies, one of our OLS assumptions is violated. Unfortunately, this will happen often when we re doing research. In these cases, we have to ask ourselves: how bad is the violation?
11 I d argue that it s probably slight, in this case. These violations apply to the extremes of the distribution; they don t apply to the majority of our observations. I wouldn t worry about it. (I ll also add: when I used more sophisticated econometric tools to correct for this problem, I found virtually the same estimates of ˆ! 0 and ˆ! 1.) Finally, there is the assumption that the variance in unobservables is the same for all observations (and that these unobservables are distributed normally). In other words, that there is the same variation in the ages of young men s wives as there is in middle-aged and older men s wives. I know that this isn t true: there s much more variation for middle-aged men than there is for anyone else. Also, what about the assumption that the variance is distributed normally? The histogram of dage revealed an asymmetry in the distribution. Let s look at the residuals that we obtained in this case: predict resid, r histogram r if r>-10 & r<10, width(1) normal start(-10.5) Density Residuals
12 The blue curve represents a normal distribution with the same mean and variance as resid. We see that this doesn t quite match up: if a wife is older than we predict, she is usually older by only one or two years. These categories are overrepresented in our histogram (relative to the normal distribution). In such a large sample, there s almost no room for leeway: if the histogram doesn t match up perfect with a normal curve, the variable is almost certainly not distributed normally. What effect does this violation have? It means that our calculations of the confidence intervals are wrong. Remember that 95% of observations should be within our confidence intervals; 2.5% should be above, and 2.5% should be below. Let s see how many observations are actually above or below or forecasts. First we ll create dummy variables for wife s actual age exceeds the upper bound and wife s actual age is below the lower bound : gen above = (wage > upperfore) gen below = (wage < lowerfore) summ above below This should give us the following results: Variable Obs Mean Std. Dev. Min Max above below You can interpret the means of the dummy variables as the fraction of the sample that falls into that category (this is always the interpretation of the mean). We have 2.26% of observations above the upper bound, and 3.25% below the lower bound. Altogether, our 95% confidence interval contains 94.48% of observations. This is a bit off (again, in a sample this size, we should get almost exactly 95% if the normality assumption is correct) but it s not bad. The more troubling feature is the asymmetry: we should have 2.5% above, but we have slightly less; and we should have 2.5% below, but we have substantially more. This happens because the distribution (of unobservables) is not symmetric. If we wanted to do this perfectly, we should model the unobservables with some other probability distribution one that is not symmetric. However, that s best left for a later day.
MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING
Interpreting Interaction Effects; Interaction Effects and Centering Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015 Models with interaction effects
Correlation and Regression
Correlation and Regression Scatterplots Correlation Explanatory and response variables Simple linear regression General Principles of Data Analysis First plot the data, then add numerical summaries Look
" Y. Notation and Equations for Regression Lecture 11/4. Notation:
Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through
Regression Analysis: A Complete Example
Regression Analysis: A Complete Example This section works out an example that includes all the topics we have discussed so far in this chapter. A complete example of regression analysis. PhotoDisc, Inc./Getty
Recall this chart that showed how most of our course would be organized:
Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical
Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software
STATA Tutorial Professor Erdinç Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software 1.Wald Test Wald Test is used
ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2
University of California, Berkeley Prof. Ken Chay Department of Economics Fall Semester, 005 ECON 14 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE # Question 1: a. Below are the scatter plots of hourly wages
IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results
IAPRI Quantitative Analysis Capacity Building Series Multiple regression analysis & interpreting results How important is R-squared? R-squared Published in Agricultural Economics 0.45 Best article of the
MULTIPLE REGRESSION EXAMPLE
MULTIPLE REGRESSION EXAMPLE For a sample of n = 166 college students, the following variables were measured: Y = height X 1 = mother s height ( momheight ) X 2 = father s height ( dadheight ) X 3 = 1 if
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY
Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY ABSTRACT: This project attempted to determine the relationship
Interaction effects and group comparisons Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015
Interaction effects and group comparisons Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 20, 2015 Note: This handout assumes you understand factor variables,
Chapter 23. Inferences for Regression
Chapter 23. Inferences for Regression Topics covered in this chapter: Simple Linear Regression Simple Linear Regression Example 23.1: Crying and IQ The Problem: Infants who cry easily may be more easily
Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.
Analyzing Complex Survey Data: Some key issues to be aware of Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 24, 2015 Rather than repeat material that is
1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
Rockefeller College University at Albany
Rockefeller College University at Albany PAD 705 Handout: Hypothesis Testing on Multiple Parameters In many cases we may wish to know whether two or more variables are jointly significant in a regression.
Lecture 15. Endogeneity & Instrumental Variable Estimation
Lecture 15. Endogeneity & Instrumental Variable Estimation Saw that measurement error (on right hand side) means that OLS will be biased (biased toward zero) Potential solution to endogeneity instrumental
Interaction effects between continuous variables (Optional)
Interaction effects between continuous variables (Optional) Richard Williams, University of Notre Dame, http://www.nd.edu/~rwilliam/ Last revised February 0, 05 This is a very brief overview of this somewhat
Nonlinear Regression Functions. SW Ch 8 1/54/
Nonlinear Regression Functions SW Ch 8 1/54/ The TestScore STR relation looks linear (maybe) SW Ch 8 2/54/ But the TestScore Income relation looks nonlinear... SW Ch 8 3/54/ Nonlinear Regression General
Chapter 7 Section 7.1: Inference for the Mean of a Population
Chapter 7 Section 7.1: Inference for the Mean of a Population Now let s look at a similar situation Take an SRS of size n Normal Population : N(, ). Both and are unknown parameters. Unlike what we used
Forecasting in STATA: Tools and Tricks
Forecasting in STATA: Tools and Tricks Introduction This manual is intended to be a reference guide for time series forecasting in STATA. It will be updated periodically during the semester, and will be
August 2012 EXAMINATIONS Solution Part I
August 01 EXAMINATIONS Solution Part I (1) In a random sample of 600 eligible voters, the probability that less than 38% will be in favour of this policy is closest to (B) () In a large random sample,
Descriptive Statistics
Descriptive Statistics Descriptive statistics consist of methods for organizing and summarizing data. It includes the construction of graphs, charts and tables, as well various descriptive measures such
Data Analysis Tools. Tools for Summarizing Data
Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool
Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011
Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this
2013 MBA Jump Start Program. Statistics Module Part 3
2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just
Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:
Doing Multiple Regression with SPSS Multiple Regression for Data Already in Data Editor Next we want to specify a multiple regression analysis for these data. The menu bar for SPSS offers several options:
Simple Regression Theory II 2010 Samuel L. Baker
SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
Introduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
Univariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
Multiple Linear Regression
Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is
5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
Violent crime total. Problem Set 1
Problem Set 1 Note: this problem set is primarily intended to get you used to manipulating and presenting data using a spreadsheet program. While subsequent problem sets will be useful indicators of the
25 Working with categorical data and factor variables
25 Working with categorical data and factor variables Contents 25.1 Continuous, categorical, and indicator variables 25.1.1 Converting continuous variables to indicator variables 25.1.2 Converting continuous
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade
Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements
Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure?
Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure? Harvey Motulsky [email protected] This is the first case in what I expect will be a series of case studies. While I mention
Drawing a histogram using Excel
Drawing a histogram using Excel STEP 1: Examine the data to decide how many class intervals you need and what the class boundaries should be. (In an assignment you may be told what class boundaries to
ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics
ESTIMATING AVERAGE TREATMENT EFFECTS: IV AND CONTROL FUNCTIONS, II Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Quantile Treatment Effects 2. Control Functions
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize
Lab 5 Linear Regression with Within-subject Correlation. Goals: Data: Use the pig data which is in wide format:
Lab 5 Linear Regression with Within-subject Correlation Goals: Data: Fit linear regression models that account for within-subject correlation using Stata. Compare weighted least square, GEE, and random
SIMPLE LINEAR CORRELATION. r can range from -1 to 1, and is independent of units of measurement. Correlation can be done on two dependent variables.
SIMPLE LINEAR CORRELATION Simple linear correlation is a measure of the degree to which two variables vary together, or a measure of the intensity of the association between two variables. Correlation
Multicollinearity Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 13, 2015
Multicollinearity Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 13, 2015 Stata Example (See appendices for full example).. use http://www.nd.edu/~rwilliam/stats2/statafiles/multicoll.dta,
How To Run Statistical Tests in Excel
How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting
Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 21, 2015
Marginal Effects for Continuous Variables Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised February 21, 2015 References: Long 1997, Long and Freese 2003 & 2006 & 2014,
KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management
KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To
HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009
HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009 1. Introduction 2. A General Formulation 3. Truncated Normal Hurdle Model 4. Lognormal
Data Analysis Methodology 1
Data Analysis Methodology 1 Suppose you inherited the database in Table 1.1 and needed to find out what could be learned from it fast. Say your boss entered your office and said, Here s some software project
Simple linear regression
Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between
Department of Economics Session 2012/2013. EC352 Econometric Methods. Solutions to Exercises from Week 10 + 0.0077 (0.052)
Department of Economics Session 2012/2013 University of Essex Spring Term Dr Gordon Kemp EC352 Econometric Methods Solutions to Exercises from Week 10 1 Problem 13.7 This exercise refers back to Equation
Using Excel s Analysis ToolPak Add-In
Using Excel s Analysis ToolPak Add-In S. Christian Albright, September 2013 Introduction This document illustrates the use of Excel s Analysis ToolPak add-in for data analysis. The document is aimed at
The Dummy s Guide to Data Analysis Using SPSS
The Dummy s Guide to Data Analysis Using SPSS Mathematics 57 Scripps College Amy Gamble April, 2001 Amy Gamble 4/30/01 All Rights Rerserved TABLE OF CONTENTS PAGE Helpful Hints for All Tests...1 Tests
Statistics Review PSY379
Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses
The F distribution and the basic principle behind ANOVAs. Situating ANOVAs in the world of statistical tests
Tutorial The F distribution and the basic principle behind ANOVAs Bodo Winter 1 Updates: September 21, 2011; January 23, 2014; April 24, 2014; March 2, 2015 This tutorial focuses on understanding rather
BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
Quick Stata Guide by Liz Foster
by Liz Foster Table of Contents Part 1: 1 describe 1 generate 1 regress 3 scatter 4 sort 5 summarize 5 table 6 tabulate 8 test 10 ttest 11 Part 2: Prefixes and Notes 14 by var: 14 capture 14 use of the
COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES.
277 CHAPTER VI COMPARISONS OF CUSTOMER LOYALTY: PUBLIC & PRIVATE INSURANCE COMPANIES. This chapter contains a full discussion of customer loyalty comparisons between private and public insurance companies
International Statistical Institute, 56th Session, 2007: Phil Everson
Teaching Regression using American Football Scores Everson, Phil Swarthmore College Department of Mathematics and Statistics 5 College Avenue Swarthmore, PA198, USA E-mail: [email protected] 1. Introduction
T O P I C 1 2 Techniques and tools for data analysis Preview Introduction In chapter 3 of Statistics In A Day different combinations of numbers and types of variables are presented. We go through these
MBA 611 STATISTICS AND QUANTITATIVE METHODS
MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain
Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares
Topic 4 - Analysis of Variance Approach to Regression Outline Partitioning sums of squares Degrees of freedom Expected mean squares General linear test - Fall 2013 R 2 and the coefficient of correlation
Section 1: Simple Linear Regression
Section 1: Simple Linear Regression Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Regression: General Introduction
Lecture Notes Module 1
Lecture Notes Module 1 Study Populations A study population is a clearly defined collection of people, animals, plants, or objects. In psychological research, a study population usually consists of a specific
CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression
Opening Example CHAPTER 13 SIMPLE LINEAR REGREION SIMPLE LINEAR REGREION! Simple Regression! Linear Regression Simple Regression Definition A regression model is a mathematical equation that descries the
Multiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
SPSS Guide: Regression Analysis
SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar
Handling missing data in Stata a whirlwind tour
Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled
Section 14 Simple Linear Regression: Introduction to Least Squares Regression
Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship
Two-sample hypothesis testing, II 9.07 3/16/2004
Two-sample hypothesis testing, II 9.07 3/16/004 Small sample tests for the difference between two independent means For two-sample tests of the difference in mean, things get a little confusing, here,
Multinomial and Ordinal Logistic Regression
Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,
12: Analysis of Variance. Introduction
1: Analysis of Variance Introduction EDA Hypothesis Test Introduction In Chapter 8 and again in Chapter 11 we compared means from two independent groups. In this chapter we extend the procedure to consider
Chapter 7: Simple linear regression Learning Objectives
Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -
Introduction to Quantitative Methods
Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................
5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
DATA INTERPRETATION AND STATISTICS
PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE
CALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents
The Assumption(s) of Normality
The Assumption(s) of Normality Copyright 2000, 2011, J. Toby Mordkoff This is very complicated, so I ll provide two versions. At a minimum, you should know the short one. It would be great if you knew
DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.
DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,
Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools
Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools Occam s razor.......................................................... 2 A look at data I.........................................................
Using Excel for Statistical Analysis
Using Excel for Statistical Analysis You don t have to have a fancy pants statistics package to do many statistical functions. Excel can perform several statistical tests and analyses. First, make sure
Chi Square Tests. Chapter 10. 10.1 Introduction
Contents 10 Chi Square Tests 703 10.1 Introduction............................ 703 10.2 The Chi Square Distribution.................. 704 10.3 Goodness of Fit Test....................... 709 10.4 Chi Square
From the help desk: Swamy s random-coefficients model
The Stata Journal (2003) 3, Number 3, pp. 302 308 From the help desk: Swamy s random-coefficients model Brian P. Poi Stata Corporation Abstract. This article discusses the Swamy (1970) random-coefficients
Using Stata 9 & Higher for OLS Regression Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 8, 2015
Using Stata 9 & Higher for OLS Regression Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised January 8, 2015 Introduction. This handout shows you how Stata can be used
Simple Linear Regression
STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals (9.3) Conditions for inference (9.1) Want More Stats??? If you have enjoyed learning how to analyze
Introduction to Hypothesis Testing
I. Terms, Concepts. Introduction to Hypothesis Testing A. In general, we do not know the true value of population parameters - they must be estimated. However, we do have hypotheses about what the true
Point Biserial Correlation Tests
Chapter 807 Point Biserial Correlation Tests Introduction The point biserial correlation coefficient (ρ in this chapter) is the product-moment correlation calculated between a continuous random variable
Main Effects and Interactions
Main Effects & Interactions page 1 Main Effects and Interactions So far, we ve talked about studies in which there is just one independent variable, such as violence of television program. You might randomly
Normality Testing in Excel
Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. [email protected]
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing
The average hotel manager recognizes the criticality of forecasting. However, most
Introduction The average hotel manager recognizes the criticality of forecasting. However, most managers are either frustrated by complex models researchers constructed or appalled by the amount of time
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
HYPOTHESIS TESTING WITH SPSS:
HYPOTHESIS TESTING WITH SPSS: A NON-STATISTICIAN S GUIDE & TUTORIAL by Dr. Jim Mirabella SPSS 14.0 screenshots reprinted with permission from SPSS Inc. Published June 2006 Copyright Dr. Jim Mirabella CHAPTER
Regression step-by-step using Microsoft Excel
Step 1: Regression step-by-step using Microsoft Excel Notes prepared by Pamela Peterson Drake, James Madison University Type the data into the spreadsheet The example used throughout this How to is a regression
Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.
Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear. In the main dialog box, input the dependent variable and several predictors.
Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures
Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone:
Econometrics I: Econometric Methods
Econometrics I: Econometric Methods Jürgen Meinecke Research School of Economics, Australian National University 24 May, 2016 Housekeeping Assignment 2 is now history The ps tute this week will go through
An analysis method for a quantitative outcome and two categorical explanatory variables.
Chapter 11 Two-Way ANOVA An analysis method for a quantitative outcome and two categorical explanatory variables. If an experiment has a quantitative outcome and two categorical explanatory variables that
Business Valuation Review
Business Valuation Review Regression Analysis in Valuation Engagements By: George B. Hawkins, ASA, CFA Introduction Business valuation is as much as art as it is science. Sage advice, however, quantitative
