Correlation and Model Fit

Similar documents
SPSS Guide: Regression Analysis

5. Linear Regression

Chapter 7: Simple linear regression Learning Objectives

Simple Linear Regression

Chapter 2: Descriptive Statistics

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

International Statistical Institute, 56th Session, 2007: Phil Everson

Multiple Linear Regression

Module 3: Correlation and Covariance

Projects Involving Statistics (& SPSS)

Using Excel for Statistical Analysis

Simple Regression Theory II 2010 Samuel L. Baker

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Correlation and Simple Linear Regression

2013 MBA Jump Start Program. Statistics Module Part 3

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Causal Infraction and Network Marketing - Trends in Data Science

Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade

Module 5: Multiple Regression Analysis

MULTIPLE REGRESSION ANALYSIS OF MAIN ECONOMIC INDICATORS IN TOURISM. R, analysis of variance, Student test, multivariate analysis

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

1.5 Oneway Analysis of Variance

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Exploratory Data Analysis. Psychology 3256

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

ANOVA. February 12, 2015

Independent samples t-test. Dr. Tom Pierce Radford University

Session 7 Bivariate Data and Analysis

Estimation of σ 2, the variance of ɛ

Additional sources Compilation of sources:

MODEL I: DRINK REGRESSED ON GPA & MALE, WITHOUT CENTERING

Lets suppose we rolled a six-sided die 150 times and recorded the number of times each outcome (1-6) occured. The data is

Statistical Models in R

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Multivariate Logistic Regression

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Coefficient of Determination

Binary Logistic Regression

CREATIVE S SKETCHBOOK

1 Descriptive statistics: mode, mean and median

Data Mining Introduction

II. DISTRIBUTIONS distribution normal distribution. standard scores

REACHING YOUR GOALS. Session 4. Objectives. Time. Materials. Preparation. Procedure. wait4sex

AP STATISTICS REVIEW (YMS Chapters 1-8)


Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Statistical Functions in Excel

Data Analysis Tools. Tools for Summarizing Data

Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance

Geostatistics Exploratory Analysis

MULTIPLE REGRESSION WITH CATEGORICAL DATA

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Fairfield Public Schools

Association Between Variables

Chapter 13 Introduction to Linear Regression and Correlation Analysis

STAT 350 Practice Final Exam Solution (Spring 2015)

Pushes and Pulls. TCAPS Created June 2010 by J. McCain

False. Model 2 is not a special case of Model 1, because Model 2 includes X5, which is not part of Model 1. What she ought to do is estimate

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

Exercise 1.12 (Pg )

Mgmt 469. Model Specification: Choosing the Right Variables for the Right Hand Side

Augmented reality enhances learning at Manchester School of Medicine

Multiple Regression: What Is It?

Marketing for Martial Arts Schools:

CALCULATIONS & STATISTICS

Chapter 5 Analysis of variance SPSS Analysis of variance

The Dummy s Guide to Data Analysis Using SPSS

SPSS Explore procedure

Comparing Nested Models

2. Simple Linear Regression

POL 204b: Research and Methodology

ANALYSIS OF TREND CHAPTER 5

Psychology 205: Research Methods in Psychology

UNIVERSITY OF NAIROBI

Descriptive Statistics

1.2 Investigations and Experiments

When to use Excel. When NOT to use Excel 9/24/2014

CHAPTER 14 NONPARAMETRIC TESTS

Statistical tests for SPSS

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Using R for Linear Regression

Ep #19: Thought Management

Simple Linear Regression, Scatterplots, and Bivariate Correlation

Joseph in Egypt. Genesis 39:2-3 the LORD was with Joseph and gave him success in everything he did.

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

Violent crime total. Problem Set 1

This chapter discusses some of the basic concepts in inferential statistics.

Chapter 23. Inferences for Regression

Why Your Business Needs a Website: Ten Reasons. Contact Us: Info@intensiveonlinemarketers.com

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Univariate Regression

Linear Models in STATA and ANOVA

Simple linear regression

Your logbook. Choosing a topic

Transcription:

Correlation and Model Fit Prof. Jacob M. Montgomery Quantitative Political Methodology (L32 363) November 9, 2016 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 1 / 36

Some class business Poster projects Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 2 / 36

Some class business Poster projects Poster files due on 12/7 at 10am. (29 days from now) Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 2 / 36

Some class business Poster projects Poster files due on 12/7 at 10am. (29 days from now) Problem set is long. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 2 / 36

Some class business Poster projects Poster files due on 12/7 at 10am. (29 days from now) Problem set is long. This lecture is pretty abstract with a lot of equations. PLEASE ask questions. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 2 / 36

Overview Last time: Inference with regression A bit on interpreting regression output Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 3 / 36

Overview Last time: Inference with regression A bit on interpreting regression output This time: A quick reversion to correlation (r) Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 3 / 36

Overview Last time: Inference with regression A bit on interpreting regression output This time: A quick reversion to correlation (r) Understanding the rest of the stuff on R-output Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 3 / 36

Overview Last time: Inference with regression A bit on interpreting regression output This time: A quick reversion to correlation (r) Understanding the rest of the stuff on R-output RMSE and Model fit (r 2 ) Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 3 / 36

Overview Last time: Inference with regression A bit on interpreting regression output This time: A quick reversion to correlation (r) Understanding the rest of the stuff on R-output RMSE and Model fit (r 2 ) F-tests Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 3 / 36

Alternative measure of correlation Pearson s r Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 4 / 36

Alternative measure of correlation Pearson s r (Standardized slope) S Y = (Y i Ȳ ) 2 n 1 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 4 / 36

Alternative measure of correlation Pearson s r (Standardized slope) S Y = S X = (Y i Ȳ ) 2 n 1 (X i X ) 2 n 1 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 4 / 36

Alternative measure of correlation Pearson s r (Standardized slope) S Y = S X = (Y i Ȳ ) 2 n 1 (X i X ) 2 n 1 r = ( Sx S Y ) ˆβ Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 4 / 36

Reminder: These are the main parameters ˆβ = n i=1 ( ) (X i X )(Y i Ȳ ) n i=1(x i X ) 2 ˆα = Ȳ ˆβ X Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 5 / 36

How good is our model?: Thinking about variance Unconditional Variance: Estimate of total variance in the population S 2 = ˆσ 2 Y = (Y i Ȳ ) 2 n 1 S = ˆσ Y = (Y i Ȳ ) 2 n 1 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 6 / 36

How good is our model?: Thinking about variance Unconditional Variance: Estimate of total variance in the population S 2 = ˆσ 2 Y = (Y i Ȳ ) 2 n 1 S = ˆσ Y = (Y i Ȳ ) 2 n 1 Sum of Squared Error: A measure of is spread around the line SSE = n i=1 (Y i Ŷ i ) 2 = n i=1 (Y i ˆα ˆβX i ) 2 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 6 / 36

How good is our model?: Thinking about variance Unconditional Variance: Estimate of total variance in the population S 2 = ˆσ 2 Y = (Y i Ȳ ) 2 n 1 S = ˆσ Y = (Y i Ȳ ) 2 n 1 Sum of Squared Error: A measure of is spread around the line SSE = n i=1 (Y i Ŷ i ) 2 = n i=1 (Y i ˆα ˆβX i ) 2 Conditional Variance: Estimate of variance around line in population ˆσ 2 = SSE n 2 = (Y i Ŷ i ) 2 SSE n 2 ˆσ = n 2 = (Y i Ŷ i ) 2 n 2 ˆσ 2 is sometimes called Mean squared error (MSE) and ˆσ is Root mean squared error (RMSE) or Residual standard error (in R) or Standard error of the estimate (in SPSS). Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 6 / 36

A bit of visualization Histogram of Y Regression of X and Y Histogram of Y when X=2 Frequency 0 5 10 15 20 25 30 35 Y 0 2 4 6 8 10 12 Frequency 0 1 2 3 4 5 0 2 4 6 8 10 Y 1 2 3 4 5 6 7 8 X 0 2 4 6 8 10 Y[X == 2] A really, really good line will have small conditional variance. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 7 / 36

How good is our model?: Thinking about variance Hold onto some basic ideas: (Y i Ȳ ) 2 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 8 / 36

How good is our model?: Thinking about variance Hold onto some basic ideas: (Y i Ȳ ) 2 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 8 / 36

How good is our model?: Thinking about variance Hold onto some basic ideas: (Y i Ȳ ) 2 = Total Sum of Squares Unconditional variance: S 2 = (Y i Ȳ ) 2 n 1 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 8 / 36

How good is our model?: Thinking about variance Hold onto some basic ideas: (Y i Ȳ ) 2 = Total Sum of Squares Unconditional variance: S 2 = (Y i Ȳ ) 2 n 1 (Y i Ŷ i ) 2 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 8 / 36

How good is our model?: Thinking about variance Hold onto some basic ideas: (Y i Ȳ ) 2 = Total Sum of Squares Unconditional variance: S 2 = (Y i Ȳ ) 2 n 1 (Y i Ŷ i ) 2 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 8 / 36

How good is our model?: Thinking about variance Hold onto some basic ideas: (Y i Ȳ ) 2 = Total Sum of Squares Unconditional variance: S 2 = (Y i Ȳ ) 2 n 1 (Y i Ŷ i ) 2 = Sum of Squared Error Conditional variance: ˆσ 2 = SSE n 2 = (Y i Ŷ i ) 2 n 2 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 8 / 36

How good is our model?: Thinking about variance Hold onto some basic ideas: (Y i Ȳ ) 2 = Total Sum of Squares Unconditional variance: S 2 = (Y i Ȳ ) 2 n 1 (Y i Ŷ i ) 2 = Sum of Squared Error Conditional variance: ˆσ 2 = SSE n 2 = (Y i Ŷ i ) 2 n 2 We are going to say that IF we have a really good model, ˆσ 2 should be a lot smaller than S 2. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 8 / 36

Regression between presidential vote outcomes and election outcomes Two-party presidential vote share 45 50 55 60 65 1980 1960 2008 1952 2012 1964 1956 1988 2004 1976 1992 1984 1996 1948 2000 1968 1972-10 -5 0 5 10 Q2 GDP growth rate Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 9 / 36

We draw the line that reduces SSE Residuals e i = (Y i Ŷ i ) = (y i ˆα ˆβX i ) Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 10 / 36

We draw the line that reduces SSE Residuals e i = (Y i Ŷ i ) = (y i ˆα ˆβX i ) Incumbent party share of vote 45 50 55 60 Residuals for presidential regression -5 0 5 10 2nd quarter GDP Growth Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 10 / 36

How good is our model? : r 2 Define some preliminary terms: TSS = (Y i Ȳ ) 2 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 11 / 36

How good is our model? : r 2 Define some preliminary terms: TSS = (Y i Ȳ ) 2 SSE = (Y i Ŷ i ) 2 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 11 / 36

How good is our model? : r 2 Define some preliminary terms: TSS = (Y i Ȳ ) 2 SSE = (Y i Ŷ i ) 2 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 11 / 36

How good is our model? : r 2 Define some preliminary terms: TSS = (Y i Ȳ ) 2 SSE = (Y i Ŷ i ) 2 R-squared r 2 = Explained Variance Total Variance = Total Variance - Unexplained Variance Total Variance Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 11 / 36

How good is our model? : r 2 Define some preliminary terms: TSS = (Y i Ȳ ) 2 SSE = (Y i Ŷ i ) 2 R-squared r 2 = Explained Variance Total Variance = Total Variance - Unexplained Variance Total Variance = TSS SSE TSS Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 11 / 36

Some notes on r 2 R-squared r 2 = TSS SSE TSS SSE=0 (Perfect fit) r 2 = 1 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 12 / 36

Some notes on r 2 R-squared r 2 = TSS SSE TSS SSE=0 (Perfect fit) r 2 = 1 SSE = TSS (No fit) r 2 = 0 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 12 / 36

Some notes on r 2 R-squared r 2 = TSS SSE TSS SSE=0 (Perfect fit) r 2 = 1 SSE = TSS (No fit) r 2 = 0 Does not depend on units of measurement Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 12 / 36

Some notes on r 2 R-squared r 2 = TSS SSE TSS SSE=0 (Perfect fit) r 2 = 1 SSE = TSS (No fit) r 2 = 0 Does not depend on units of measurement Sometimes called the coefficient of determination Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 12 / 36

Some notes on r 2 R-squared r 2 = TSS SSE TSS SSE=0 (Perfect fit) r 2 = 1 SSE = TSS (No fit) r 2 = 0 Does not depend on units of measurement Sometimes called the coefficient of determination It does not penalize for model complexity. Often adjusted R-squared is used. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 12 / 36

Some notes on r 2 R-squared r 2 = TSS SSE TSS SSE=0 (Perfect fit) r 2 = 1 SSE = TSS (No fit) r 2 = 0 Does not depend on units of measurement Sometimes called the coefficient of determination It does not penalize for model complexity. Often adjusted R-squared is used. In R-output this is labeled Multiple R-squared Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 12 / 36

Some notes on r 2 R-squared r 2 = TSS SSE TSS SSE=0 (Perfect fit) r 2 = 1 SSE = TSS (No fit) r 2 = 0 Does not depend on units of measurement Sometimes called the coefficient of determination It does not penalize for model complexity. Often adjusted R-squared is used. In R-output this is labeled Multiple R-squared Why do we use it? Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 12 / 36

Some notes on r 2 R-squared r 2 = TSS SSE TSS SSE=0 (Perfect fit) r 2 = 1 SSE = TSS (No fit) r 2 = 0 Does not depend on units of measurement Sometimes called the coefficient of determination It does not penalize for model complexity. Often adjusted R-squared is used. In R-output this is labeled Multiple R-squared Why do we use it? Gives us an overall impression for how well our model is doing. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 12 / 36

Some notes on r 2 R-squared r 2 = TSS SSE TSS SSE=0 (Perfect fit) r 2 = 1 SSE = TSS (No fit) r 2 = 0 Does not depend on units of measurement Sometimes called the coefficient of determination It does not penalize for model complexity. Often adjusted R-squared is used. In R-output this is labeled Multiple R-squared Why do we use it? Gives us an overall impression for how well our model is doing. We can informally compare models. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 12 / 36

R output Call: lm(formula = vote ~ q2gdp, data = Abram) Residuals: Min 1Q Median 3Q Max -6.002-3.409 0.084 2.078 8.496 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 49.2560 1.4411 34.179 1.21e-15 *** q2gdp 0.7549 0.2578 2.928 0.0104 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 4.481 on 15 degrees of freedom Multiple R-squared: 0.3637, Adjusted R-squared: 0.3213 F-statistic: 8.573 on 1 and 15 DF, p-value: 0.01039 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 13 / 36

Example: Explaining congressional elections Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 14 / 36

A real quick primer on the F-statistic for Regression Before we said that r 2 is intuitively r 2 Explained Variance =. It makes Total Variance sense then that (1 r 2 ) is the percent of variance we haven t explained. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 15 / 36

A real quick primer on the F-statistic for Regression Before we said that r 2 is intuitively r 2 Explained Variance =. It makes Total Variance sense then that (1 r 2 ) is the percent of variance we haven t explained. It turns out that: F-statistic for regression F = r 2 /p (1 r 2 )/[n (p + 1)] Here p is the number of covariates (gdp, incumbent, etc.), and n is the number of observations. This will be distributed according to the F-distribution with df 1 = p, and df 2 = n (p + 1). Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 15 / 36

Example F-Distributions And now you understand (almost) everything on a regression table. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 16 / 36

Hypothesis testing with F-statistics Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 17 / 36

Interpreting the F-test Is our model any good? This is a formalized way of asking whether our model is any good. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 18 / 36

Interpreting the F-test Is our model any good? This is a formalized way of asking whether our model is any good. We compare the amount of variance explained by the regression to the amount unexplained. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 18 / 36

Interpreting the F-test Is our model any good? This is a formalized way of asking whether our model is any good. We compare the amount of variance explained by the regression to the amount unexplained. This is essentially a comparison of the following two models: Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 18 / 36

Interpreting the F-test Is our model any good? This is a formalized way of asking whether our model is any good. We compare the amount of variance explained by the regression to the amount unexplained. This is essentially a comparison of the following two models: H 0 : Y i = α + ɛ i Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 18 / 36

Interpreting the F-test Is our model any good? This is a formalized way of asking whether our model is any good. We compare the amount of variance explained by the regression to the amount unexplained. This is essentially a comparison of the following two models: H 0 : Y i = α + ɛ i H a : Y i = α + X i β + ɛ i Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 18 / 36

Interpreting the F-test Is our model any good? This is a formalized way of asking whether our model is any good. We compare the amount of variance explained by the regression to the amount unexplained. This is essentially a comparison of the following two models: H 0 : Y i = α + ɛ i H a : Y i = α + X i β + ɛ i This is more useful in multivariate regression: Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 18 / 36

Interpreting the F-test Is our model any good? This is a formalized way of asking whether our model is any good. We compare the amount of variance explained by the regression to the amount unexplained. This is essentially a comparison of the following two models: H 0 : Y i = α + ɛ i H a : Y i = α + X i β + ɛ i This is more useful in multivariate regression: H 0 : Y i = α + ɛ i Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 18 / 36

Interpreting the F-test Is our model any good? This is a formalized way of asking whether our model is any good. We compare the amount of variance explained by the regression to the amount unexplained. This is essentially a comparison of the following two models: H 0 : Y i = α + ɛ i H a : Y i = α + X i β + ɛ i This is more useful in multivariate regression: H 0 : Y i = α + ɛ i H a : Y i = α + X i1 β 1 + X i2 β 2 + X i3 β 3 +... + ɛ i Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 18 / 36

R output Call: lm(formula = vote ~ q2gdp, data = Abram) Residuals: Min 1Q Median 3Q Max -6.002-3.409 0.084 2.078 8.496 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 49.2560 1.4411 34.179 1.21e-15 *** q2gdp 0.7549 0.2578 2.928 0.0104 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 4.481 on 15 degrees of freedom Multiple R-squared: 0.3637, Adjusted R-squared: 0.3213 F-statistic: 8.573 on 1 and 15 DF, p-value: 0.01039 Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 19 / 36

Posters: General guidelines If you gather data, no sensitive questions. No time series data You need to have a plan to deal with endogeneity Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 20 / 36

The first thing you need to do... 1 Think of a research question! This step should always come first. You, as the researcher, need to formulate some sort of question that you want answered. You should care about this question and its answer! Not in an I want a good grade way, but in an I really care about the answer to this question kind of way. If you don t care about the question itself, then the project will be miserable to complete. Make it easier on yourself and research something you re actually interested in. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 21 / 36

Once you have a question... 1 Once you have settled on a question, the next step is to come up with a hypothesis pertaining to that question. 2 The hypothesis needs to be testable with the tools that you have learned (or will learn) in this course. This generally precludes giant questions: Why do Americans vote? What makes people make environmentally friendly choices? Does President Obama have an American birth certificate? 3 However, smaller questions are interesting too! Do yard signs make people more likely to vote? Can door knocking campaigns lead to higher levels of recycling? Are racial attitudes related to beliefs about whether President Obama is not a citizen? Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 22 / 36

Once you have a question... 1 This hypothesis does not have to ultimately be supported. Most hypotheses are wrong especially interesting ones. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 23 / 36

Once you have a question... 1 Your question must be testable with data, meaning that you have to find a proper data source for your research project. 2 If you cannot find data that helps answer your question, then, for the purposes of this class, the question is unanswerable. If this applies to your research question, then it s time to find a new one. 3 Fortunately, the internet is full of data sources. If you can think of a data set you may need, there is a good chance that someone has collected it. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 24 / 36

Finding Data 1 There are many data sources. And you can use any data source you want for your project, as long as the data provides evidence for an answer to your question. Do you wonder if you re data is appropriate? Ask yourself if the data you have collected would be able to convince someone else of your answer. If there is any question as to whether or not this is the case, you should search for more data. While having data is better than not having data, there are differences in the quality of data across data sets. Some data is very credible while other data should certainly be scrutinized. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 25 / 36

Good Data Sources 1 For political science data, the best place to begin looking for data is http://projects.iq.harvard.edu/undergradscholars/book/datasets. This website is a new project by Harvard s Institute for Quantitative Social Sciences that is meant to direct undergraduate students on how to do a quantitative social science research project. The link above is to the data section, which contains a database of credible data sources. If you want to do something in the political science realm, this is the place to begin looking for the data set that best suits your question. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 26 / 36

Specific Data Sources 1 For many questions regarding American political attitudes and behavior, the American National Election Study (ANES) is the best survey to use. Survey is done every two years before and after an election. Asks many sociological and political questions. Attempts to capture behavior through a large set of questions. The survey has a relatively large sample size. However, not all respondents are asked the same battery of questions. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 27 / 36

Specific Data Sources 1 For questions more of the sociological nature and questions that the ANES just doesn t seem to work well for, the General Social Survey (GSS) may suit your needs better. The GSS is done every one or two years and has a very large sample of respondents. Asks many sociological questions and questions about attitudes towards issues. Only samples respondents from the United States. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 28 / 36

Other Data Sources 1 For questions that could be categorized in the realm of international politics, you can use the Correlates of War surveys and its various offshoots. Another good source is the V-Dem project (https://www.v-dem.net/en/) 2 In general, any data set that you find a link for on the IQSS website is going to be credible and rather informative. Start early and spend a good amount of time finding the right data set. 3 Many scholars post their datasets on their websites. Many journals archive related datasets on their websites. If you find a good paper written in the last 5 years, it might just be online waiting for you. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 29 / 36

Other Data Sources 1 Does your question require an obscure data source that cannot be found on the IQSS website? Then, searching the ICPSR data set archives might be the best option for you. http://www.icpsr.umich.edu/icpsrweb/icpsr/index.jsp ICPSR is an initiative based at the University of Michigan that houses a huge data base of political science data sets. Data sets contained in this archive range from Iowa Census data from 1908 to the results of every U.S. House election from the 1940 s to the present. This database is less user friendly than the IQSS student one, but if you can t find what you need on that website, then this may be your next best option. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 30 / 36

Other Data Sources 1 The IQSS also has another, albeit less user friendly, database of data sources: http://dvn.iq.harvard.edu/dvn/ Like the ICPSR website, there is a very large set of data sets here. However, it is less user friendly than the IQSS student website. Finding the right data may take a long time to find on this website. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 31 / 36

Non-Political Science Data 1 The question for your project does not have to be political science related! You can choose to do a project on any question that you are interested in answering. If you re interested in sports, then the following website is a database of sports related data sources: http://www.amstat.org/sections/sis Another database of data sets can be found on http: //rss.acs.unt.edu/rdoc/library/ecdat/html/00index.html. These data sets, which are actually a part of an R package vary on subject from tobacco budgets to crime in North Carolina to Heating and Cooling system choices. Before you use any of these data sets, though, make sure that there is a valid codebook which contains information about the survey method, sampling procedures, etc. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 32 / 36

Other Options 1 Are none of these data sources good enough for you? Then maybe you should collect your own data. Remember, we discussed sampling schemes at the very beginning of the course, so we expect you to tell us why your sampling scheme is good if you choose to collect your own data. Keep in mind that you need a decent sample size. However, if you can pull it off you can often get much better data to answer your question of interest. However, if you choose to collect your own data, start early and talk to your group, the TAs, etc. The more opinions about your project you get, the better it will be. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 33 / 36

Rules 1 Above all else, if you choose to collect your own data it has to be appropriate! In research, we have a board that approves human research projects called the IRB. If you think your data is questionable by appropriateness standards, read the IRB website to see if it would meet their standards. In short, this eliminates surveys involving minors, the homeless population, prisoners, the mentally handicapped, and any other population that is in a position that might cause them to feel obligated to take part in your survey. In addition to these requirements, your data cannot involve questions about alcohol or drug usage. Use your common sense! Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 34 / 36

How to get a good grade. 1 Once you have your data and have done all of the statistical analysis that you can possibly do, you and your team are tasked to make a poster telling everybody about the awesome work you have done. Your findings may seem minor to you, but they are probably much more interesting than you give yourself credit for. The data is new to everyone else, so they will probably find it very interesting. 2 The best way to make sure that your poster gets a good grade from the QPM team is to follow the instructions on the last page of your syllabus! The rubric for grading your poster is very descriptive and thorough. The examples on blackboard all scored an A based on this rubric. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 35 / 36

Some notes about the posters 1 We need documented R code for everything you do to make your poster! And this R code should be full of comments! Lots and lots of comments! Lots of comments eliminates guess work if we aren t sure what you are doing in your R code. Less guess work almost absolutely leads to a better grade. 2 Spend a good amount of time on the statistical analysis as well as the poster presentation. We like well thought out projects as well as well presented projects. Both of these contribute to your overall grade. 3 Finally, don t wait until the last minute for this project. This project is a pretty significant portion of your grade and should be treated as such. Waiting until the last minute to do this project will surely lead to a poor grade. Lecture 19 (QPM 2016) R, R-squared and F November 9, 2016 36 / 36