A Short Guide to R with RStudio



Similar documents
A Short Guide to Stata 13

OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE

Descriptive Statistics

ECONOMICS 351* -- Stata 10 Tutorial 2. Stata 10 Tutorial 2

Data Analysis Tools. Tools for Summarizing Data

Getting Started with R and RStudio 1

Introduction to RStudio

Using R for Windows and Macintosh

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Introduction to Statistical Computing in Microsoft Excel By Hector D. Flores; and Dr. J.A. Dobelman

Directions for using SPSS

Introduction to SPSS 16.0

Financial Econometrics MFE MATLAB Introduction. Kevin Sheppard University of Oxford

R with Rcmdr: BASIC INSTRUCTIONS

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

Installing R and the psych package

Analysis of categorical data: Course quiz instructions for SPSS

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

An introduction to IBM SPSS Statistics

Getting started manual

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Using Excel for descriptive statistics

EXCEL Tutorial: How to use EXCEL for Graphs and Calculations.

An introduction to using Microsoft Excel for quantitative data analysis

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Importing Data into R

Describing, Exploring, and Comparing Data

Using Excel for Data Manipulation and Statistical Analysis: How-to s and Cautions

Data exploration with Microsoft Excel: analysing more than one variable

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Data analysis and regression in Stata

Calibration and Linear Regression Analysis: A Self-Guided Tutorial

Microsoft Excel 2010 Part 3: Advanced Excel

Introduction Course in SPSS - Evening 1

Using Microsoft Excel to Plot and Analyze Kinetic Data

Introduction to IBM SPSS Statistics

SPSS Tests for Versions 9 to 13

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

January 26, 2009 The Faculty Center for Teaching and Learning

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Using Excel for Statistics Tips and Warnings

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Microsoft Excel. Qi Wei

MULTIPLE REGRESSION EXAMPLE

Using SPSS, Chapter 2: Descriptive Statistics

Module 2 Basic Data Management, Graphs, and Log-Files

SPSS Resources. 1. See website (readings) for SPSS tutorial & Stats handout

Please follow these guidelines when preparing your answers:

Statistical Analysis Using SPSS for Windows Getting Started (Ver. 2014/11/6) The numbers of figures in the SPSS_screenshot.pptx are shown in red.

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Using Excel for Statistical Analysis

Using stargazer to report regression output and descriptive statistics in R (for non-latex users) (v1.0 draft)

Figure 1. An embedded chart on a worksheet.

SAS Analyst for Windows Tutorial

5 Correlation and Data Exploration

SPSS Introduction. Yi Li

Getting your data into R

Quickstart for Desktop Version

An SPSS companion book. Basic Practice of Statistics

Data Visualization with R Language

Module 5: Statistical Analysis

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Gamma Distribution Fitting

IBM SPSS Missing Values 22

Agilent 85190A IC-CAP 2008

Step 2: Save the file as an Excel file for future editing, adding more data, changing data, to preserve any formulas you were using, etc.

Assignment objectives:

4 Other useful features on the course web page. 5 Accessing SAS

Summary of important mathematical operations and formulas (from first tutorial):

Instructions for SPSS 21

Homework 11. Part 1. Name: Score: / null

SPSS Manual for Introductory Applied Statistics: A Variable Approach

Below is a very brief tutorial on the basic capabilities of Excel. Refer to the Excel help files for more information.

Table of Contents. Preface

03 The full syllabus. 03 The full syllabus continued. For more information visit PAPER C03 FUNDAMENTALS OF BUSINESS MATHEMATICS

Data Analysis Stata (version 13)

GeoGebra Statistics and Probability

Time Series Analysis with R - Part I. Walter Zucchini, Oleg Nenadić

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

IBM SPSS Missing Values 20

Zahner 08/2013. Monitor measurements in a separate window.

Introduction to SPSS (version 16) for Windows

SPSS Tutorial, Feb. 7, 2003 Prof. Scott Allard

SPSS 12 Data Analysis Basics Linda E. Lucek, Ed.D

Basics of STATA. 1 Data les. 2 Loading data into STATA

Prof. Pietro Ducange Students Tutor and Practical Classes Course of Business Intelligence

7 Time series analysis

Appendix 2.1 Tabular and Graphical Methods Using Excel

Clustering in the Linear Model

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

Psych. Research 1 Guide to SPSS 11.0

Appendix III: SPSS Preliminary

TIPS FOR DOING STATISTICS IN EXCEL

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Introduction to StatsDirect, 11/05/2012 1

Transcription:

Short Guides to Microeconometrics Fall 2013 Prof. Dr. Kurt Schmidheiny Universität Basel A Short Guide to R with RStudio 1 Introduction 2 2 Installing R and RStudio 2 3 The RStudio Environment 2 4 Additions to R: Packages 2 5 Where to get help 3 6 Opening and Saving Data 4 7 Importing Data 4 8 Data Manipulation 6 9 Descriptive Statistics 7 10 Graphs 8 11 OLS Regression 9 12 Important Functions and Operators 11 Version: 22-9-2013, 21:56

A Short Guide to R with RStudio 2 1 Introduction This guide introduces the basic commands of the statistical software R using the graphical interface RStudio. Examples are shown using Windows, the adaption to other platforms such as Mac OS and Linux is straightforward. This guide only introduces the basic commands for data management and estimation. More commands (on panel data, limited dependent variables, monte carlo experiments, etc.) will be described in the respective handouts. R commands are set in Courier. 2 Installing R and RStudio R is a free open-source statistical software for various platforms such as Windows, Mac OS and Linux. R can be download from http://www.rproject.org/. RStudio is a convenient environment to run R. It is also a free, open-source, cross-platform software available at http://rstudio.org/. Install R first, and then RStudio. 3 The RStudio Environment When you start RStudio you will see a window with several panes: the Console pane where you type in your R commands and where results will be reported, the Workspace pane where open datasets and other objects are listed, the Plots pane where graphs are drawn, the History pane with a list of all your previously issued commands. More panes will be described below. 4 Additions to R: Packages Many researchers provide their own R programs through the R project webpage. Many packages are already preinstalled in the basic R instal-

3 Short Guides to Microeconometrics lation. They can be directly activated from RStudio: go to the Packages pane and tick the corresponding package. Alternatively, they are activated by issuing a command in the Console. For example, > library("foreign") activates additional commands to import data from foreing, i.e. other, software such as Stata and SAS. New packages can be installed by clicking on Install Packages in the Packages pane or with the command > install.packages("foreign") You will need to activate the new package before using it. 5 Where to get help The online help in R describes all basic R commands as well as commands in active packages. You can directly search the online help from the Help pane in RStudio. Alternatively, the help for a known command, e.g. load, is displayed in the Help pane by issuing the command or >?load > help("load") If you don t know the exact expression for the command, you can search the R documentation. For example, > help.search("read") lists in the top-left pane all packages and commands that contain the word read.

A Short Guide to R with RStudio 4 6 Opening and Saving Data R will look for data or save data in the drive and working directory. The working directory is specified depending on the operation system by > setwd("/users/sck/mypath") # example Mac OS > setwd("c:/my documents/mydir") # example Windows > setwd("c:\\my documents\\mydir") # alternative Windows Note: R cannot handle the usual windows path "c:\my documents\mydir". Open an existing R datafile (extension.rdata or.rda) or > load("example.rdata") > load("c:/my documents/mydir/example.rdata") The data is stored in an object called dataframe in the R workspace and appears in the Workspace pane. In the usual case where we are using variables from only one dataframe, it is convenient to apply the command > attach(mydata) which allows to access variables without specifying the dataframe. You can view the dataframe by double-clicking on its name, e.g. mydata, in the Workspace pane. Alternatively, you can list the first lines with the command > head(mydata) Save data in R format: > save(mydata, file="example.rdata") where mydata is the name of the dataframe to be saved. 7 Importing Data Data from common statistical packages like SAS, SPSS, STATA can be imported using the package foreign. For example, the following com-

5 Short Guides to Microeconometrics mand imports data from STATA: > library(foreign) > read.dta("c:/mypath/example.dta", convert.factors=false) where the option convert.factors=false asks R not to automatically convert all categorical variables into factor variables (see section 8). Comma-separated files can be imported with the following command: > mydata <- read.csv("c:/mypath/example.csv", sep=",") where example.csv is the name of the comma-separated file and mydata is the name of the dataframe to be created in the R workspace. You can set the comma separator (, or ; ) used in the datafile with the option sep. If you have already opened a dataframe with the same name in R, the old one will be automatically replaced. If the spreadsheet to be imported contains missing value codes, e.g. -999, rather than empty cells, the following commands converts these into the R code for missing values: > mydata <- read.csv("example.csv", na.strings="-999") Comma-separated files can be generated with e.g. Excel: Make sure that missing data values are coded as empty cells or as numeric values (e.g., 999 or -1). Do not use strings (e.g -, N/A ) to represent missing data. Make sure that there are only dots (. ) but no commas (, ) or semicolons ( ; ) in numbers. You may need to change this under Format menu, then select Cells.... Make sure that variable names are included only in the first row of your spreadsheet. Excel uses a different separator depending on the country settings: with English settings, a comma (, ) is typically used, with other languages a semicolon ( ; ) may be used.

A Short Guide to R with RStudio 6 Under the File menu, select Save As.... Then Save as type Comma Separated(.csv). The file will be saved with a.csv extension. 8 Data Manipulation All subsequent commands are shown using the example dataset mtcars. See?mtcars for a description of the data. The example data is loaded and stored in dataframe mtcars as > data(mtcars) A new variable in the dataframe mtcars is created by e.g. > mtcars$logmpg <- log(mtcars$mpg) where logmpg is the name of the new variable and log(mpg) is an example of a mathematical function of existing variables. If the dataframe is attached, the command is simply > mtcars$logmpg <- log(mpg) There are different ways of creating dummy variables in R: > mtcars$heavy = (mtcars$wt>=3) creates a logical variable which is TRUE if cars are heavier than 3,000 lb and FALSE otherwise. Logical variables are treated like a factor variable in estimation commands. Factor variables are categorical variables that can be either numeric or string variables. Categorical variables can only take a limited number of ordered or unordered values. A numeric dummy variable is created by or > mtcars$heavy = as.numeric(mtcars$wt>=3) > mtcars$heavy = ifelse(mtcars$wt>=3,1,0) in both cases the resulting variable is 1 for heavy cars and 0 otherwise. Numeric variables can be turned into factor variables by

7 Short Guides to Microeconometrics > mtcars$heavyf <- factor(mtcars$heavy) which also takes the two values 0 and 1. String labels can be attached to the two values by > mtcars$heavyf <- factor(mtcars$heavy, labels = c("light", "heavy")) Note: Variables are automatically replaced if a new assignment is made. Conditional assignments can be made by choosing a subset of observations with the bracket, [], notation. For example, > mtcars$hptype <- NA > mtcars$hptype[mtcars$hp<100] <- "economy" > mtcars$hptype[mtcars$hp>=100 & mtcars$hp < 200] <- "regular" > mtcars$hptype[mtcars$hp>=200] <- "sport" > mtcars$hptype = factor(mtcars$hptype) creates a factor variable with 3 types of horsepower. Variables are deleted from the dataframe by > mtcars$heavy <- NULL 9 Descriptive Statistics Display univariate summary statistics (mean, median,...) variables mpg in dataframe mtcars varlist: of numeric > summary(mtcars$mpg) Or if the dataframe mtcars is attached, simply > attach(mtcars) > summary(mpg) Statistics based on a subset of observations are given by e.g. > summary(subset(mpg, wt>=3)) or equivalently > summary(mpg[wt>=3]) where only observations with weight wt heavier than 3000 lb are used.

A Short Guide to R with RStudio 8 Report the frequency counts of variable am: > table(am, usena="ifany") where the option usena="ifany" requests to report missing values. Display the correlation and covariance matrix, respectively, between all numeric variables in the dataframe mtcars > cor(mtcars, use="pairwise.complete.obs") > cov(mtcars, use="pairwise.complete.obs") The option use="pairwise.complete.obs" asks to use all pairs of observations without missing values. The correlation between the two variables mpg and wt is computed by > cor(mpg, wt, use="complete") A basic two-way table of frequency counts of two variables gear and am is produced by: > table(gear, am) where the option usena = "ifany" allows to include missing values. The package gmodels adds a command to produce a detailed twoway table with counts, row and column percentages along with Pearson s chi-square statistic: > install.packages("gmodels") > library(gmodels) > CrossTable(gear, am, chisq=true) Perform a two-sample t-test of the hypothesis that mpg has the same mean for the two groups defined by the dummy variable am > t.test(mpg~am) where the two-samples are by default not assumed to have equal variances. 10 Graphs Draw a scatter plot of the variable mpg (y-axis) against wt (x-axis):

9 Short Guides to Microeconometrics > plot(wt, mpg) Add a red regression line after drawing the scatter plot > abline(lm(mpg~wt), col="red") Draw a scatter with connected points, title and labels > plot(wt, mpg, type="l", main="mileage and weight", xlab="weight", ylab="mile per gallon") Draw a histogram of the variable mpg > hist(mpg, freq=false) 11 OLS Regression The multiple linear regression model mpg i = β 0 + β 1 wt i + β 2 disp i + u i is estimated by OLS with the lm function. For example, > fm <- lm(mpg~wt+disp, data=mtcars) > summary(fm) regresses the mileage of a car (mpg) on weight (wt) and displacement (disp). A constant is automatically added if not suppressed by > lm(mpg~wt+disp-1, data=mtcars) Estimation based on a subsample is performed as > lm(mpg~wt+disp, data=mtcars, subset=(wt>3)) where only cars heavier than 3000 lb are considered. Tranformations of variables are directly included with the I() function > lm(i(log(mpg))~wt+i(wt^2)+disp, data=mtcars) The Eicker-Huber-White covariance is reported after estimation with > library(sandwich)

A Short Guide to R with RStudio 10 > library(lmtest) > coeftest(fm, vcov=sandwich) F -tests for one or more restrictions are calculated with the command waldtest which also uses the two libraries sandwich and lmtest > waldtest(fm, "wt", vcov=sandwich) tests H 0 : β 1 = 0 against H A : β 1 0 with Eicker-Huber-White, and > waldtest(fm,.~.-wt-disp, vcov=sandwich) tests H 0 : β 1 = 0 and β 2 = 0 against H A : β 1 0 or β 2 0. New variables with residuals and fitted values are generated by > mtcars$uhat <- resid(fm) > mtcars$mpghat <- fitted(fm) R treats factor variables different from numeric variables in estimation commands. See the handout on Functional Form in the linear Model for more details.

11 Short Guides to Microeconometrics 12 Important Functions and Operators Some Mathematical Expressions abs(x) returns the absolute value of x. exp(x) returns the exponential function of x. log(x) returns the natural logarithm of x if x>0. log10(x) returns the log base 10 of x if x>0. sqrt(x) returns the square root of x if x>=0. floor(x) largest integers not greater than x ceiling(x) smallest integers not less than x trunc(x) integer by truncating x toward 0 round(x, digits=n) rounds x to the nearest n decimal places Logical and Relational Operators > greater than < less than >= greater or equal <= smaller or equal == equal!= not equal & and or! not Some Probability distributions and density functions pnorm(q) dnorm(x) qnorm(p) cumulative standard normal distribution returns the standard normal density inverse cumulative standard normal distribution