Merge/Append using R (draft)

Similar documents
Exploring Data and Descriptive Statistics (using R)

POLS 7014: Intermediate Political Methodology. Spring 2016

How To Run Factor Analysis

Getting Started in Fixed/Random Effects Models using R

Multinomial Logistic Regression

Data Preparation & Descriptive Statistics (ver. 2.7)

PROBABILITY AND STATISTICS. Ma To teach a knowledge of combinatorial reasoning.

Financial Econometrics

240ST014 - Data Analysis of Transport and Logistics

INTERNATIONAL UNIVERSITY OF JAPAN Public Management and Policy Analysis Program Graduate School of International Relations

Calculating, Interpreting, and Reporting Estimates of Effect Size (Magnitude of an Effect or the Strength of a Relationship)

Panel Data Analysis Fixed and Random Effects using Stata (v. 4.2)

Sociology 105: Research Design and Sociological Methods Spring 2014 Dr. Christopher Sullivan

Teaching model: C1 a. General background: 50% b. Theory-into-practice/developmental 50% knowledge-building: c. Guided academic activities:

Brown University Department of Economics Spring 2015 ECON 1620-S01 Introduction to Econometrics Course Syllabus

Getting Started in Data Analysis using Stata

DBA Courses and Sequence (2015-)

From the help desk: Swamy s random-coefficients model

A Short Guide to R with RStudio

POL 451: Statistical Methods in Political Science

ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM COLLEGE OF SCIENCE. School of Mathematical Sciences

SYLLABUS. OFFICE AND HOURS: Karnoutsos 536 (Access through K506) M 12, T 1, R 10, 12, 2 or by appointment. I am available by at all times.

Master Black Belt TRAINING REQUIREMENTS

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Association Between Variables

Getting Started in R~Stata Notes on Exploring Data

UNDERSTANDING THE DEPENDENT-SAMPLES t TEST

Econometrics and Data Analysis I

Using stargazer to report regression output and descriptive statistics in R (for non-latex users) (v1.0 draft)

UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST

TABLE OF CONTENTS. About Chi Squares What is a CHI SQUARE? Chi Squares Hypothesis Testing with Chi Squares... 2

William Paterson University of New Jersey Department of Computer Science College of Science and Health Course Outline

Executive Master of Public Administration. QUANTITATIVE TECHNIQUES I For Policy Making and Administration U6310, Sec. 03

Multiple Imputation for Missing Data: A Cautionary Tale

Is There an Insider Advantage in Getting Tenure?

Standard errors of marginal effects in the heteroskedastic probit model

Statistical Methods for research in International Relations and Comparative Politics

Syllabus: Environmental Economics

PARALLEL LINES ASSUMPTION IN ORDINAL LOGISTIC REGRESSION AND ANALYSIS APPROACHES

Total Credits: 32 credits are required for master s program graduates and 53 credits for undergraduate program.

Course description Analysis of Survey Data, 7.5 ECTS-credits, advanced level

California University of Pennsylvania Guidelines for New Course Proposals University Course Syllabus Approved: 2/4/13. Department of Psychology

for an appointment,

Business Analytics (AQM5201)

Introduction to Data Analysis in Hierarchical Linear Models

COURSE OUTLINE OLG 611: STRATEGIC HUMAN RESOURCE MANAGEMENT THE OPEN UNIVERSITY OF TANZANIA FACULTY OF BUINESS MANAGEMENT MASTER OF PROJECT MANAGEMENT

SPPH 501 Analysis of Longitudinal & Correlated Data September, 2012

Methods to Assist in Teaching Planning and Scheduling

Marketing Management and Management Ethics

Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set

CURRICULUM PROPOSAL. Activities To carry out The Plan

An Evaluation of Entrepreneurship Education Programme in Kenya

Data Cleaning and Missing Data Analysis

ECON 523 Applied Econometrics I /Masters Level American University, Spring Description of the course

Using outreg2 to report regression output, descriptive statistics, frequencies and basic crosstabulations (v1.6 draft)

James E. Bartlett, II is Assistant Professor, Department of Business Education and Office Administration, Ball State University, Muncie, Indiana.

Python Lists and Loops

EPS 625 INTERMEDIATE STATISTICS FRIEDMAN TEST

ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM COLLEGE OF SCIENCE. School of Mathematical Sciences

MODULE SPECIFICATION FORM. BUS748 Cost Centre: GAMP JACS2 code*: N211. Strategic Thinking and Effecting Change. Level: 7 Credit Value: 20

Finance. Corporate Finance. Additional information: See Moodle. Investments

City University of Hong Kong

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

UNDERSTANDING ANALYSIS OF COVARIANCE (ANCOVA)

Dataset Management with Microsoft Access

The Leadership Qualities of a Successful Municipal CAO

Family economics data: total family income, expenditures, debt status for 50 families in two cohorts (A and B), annual records from

USING SEASONAL AND CYCLICAL COMPONENTS IN LEAST SQUARES FORECASTING MODELS

On Marginal Effects in Semiparametric Censored Regression Models

Understanding Power and Rules of Thumb for Determining Sample Sizes

POINT OF SALE FRAUD PREVENTION

MBA and M.Sc. Courses

GEOG Remote Sensing

Assessing Project Management as an Academic Learning Outcome (ALO)

Cost implications of no-fault automobile insurance. By: Joseph E. Johnson, George B. Flanigan, and Daniel T. Winkler

Metrics for Managers: Big Data and Better Answers. Fall Course Syllabus DRAFT. Faculty: Professor Joseph Doyle E

MED RESEARCH IN MATHEMATICS EDUCATION SPRING, 2006

Data Quality Assessment

PROPOSAL FOR A GRADUATE CERTIFICATE IN SOCIAL SCIENCE METHODOLOGY

Evaluating Different Methods of Delivering Course Material: An Experiment in Distance Learning Using an Accounting Principles Course

WILLIAM PATERSON UNIVERSITY COLLEGE OF SCIENCE AND HEALTH DEPARTMENT OF NURSING COURSE SYLLABUS

Additional sources Compilation of sources:

Using the Delta Method to Construct Confidence Intervals for Predicted Probabilities, Rates, and Discrete Changes

Economics 386 Introduction to Forecasting in Economics and Finance Spring 2016 Mondays and Wednesdays 01:10 PM-02:30 PM in PAC 125

STOCK MARKET VOLATILITY AND REGIME SHIFTS IN RETURNS

Form 2B City University of Hong Kong

Time-Series Regression and Generalized Least Squares in R

The American Criminal Justice Association Lambda Alpha Epsilon Region III 2014 Fall Conference October 9-12, 2014 Kansas City, Missouri Robbers we

Total Minutes 1 Strategic Human Resource Management

Transcription:

Merge/Append using R (draft) Oscar Torres-Reyna Data Consultant otorres@princeton.edu January, 2011 http://dss.princeton.edu/training/

Intro Merge adds variables to a dataset. This document will use merge function. Merging two datasets require that both have at least one variable in common (either string or numeric). If string make sure the categories have the same spelling (i.e. country names, etc.). Explore each dataset separately before merging. Make sure to use all possible common variables (for example, if merging two panel datasets you will need country and years). Append adds cases/observations to a dataset. This document will use the rbind function. Appending two datasets require that both have variables with exactly the same name. If using categorical data make sure the categories on both datasets refer to exactly the same thing (i.e. 1 Agree, 2 Disagree, 3 DK on both). PU/DSS/OTR 2

mydata1 MERGE EXAMPLE 1 mydata2 mydata <- merge(mydata1, mydata2, by=c("country","year")) edit(mydata) PU/DSS/OTR 3

MERGE EXAMPLE 2 (one dataset missing a country) mydata1 mydata3 Merge merges only common cases to both datasets mydata <- merge(mydata1, mydata3, by=c("country","year")) edit(mydata) PU/DSS/OTR 4

mydata1 MERGE EXAMPLE 2 (cont.) including all data from both datasets mydata3 Adding the option all=true includes all cases from both datasets. mydata <- merge(mydata1, mydata3, by=c("country","year"), all=true) edit(mydata) PU/DSS/OTR 5

mydata1 MERGE EXAMPLE 3 (many to one) mydata4 mydata <- merge(mydata1, mydata4, by=c("country")) edit(mydata) PU/DSS/OTR 6

mydata1 MERGE EXAMPLE 4 (common ids have different name) mydata5 When common ids have different names use by.x and by.y to match them. R will keep the name of the first dataset (by.x) mydata <- merge(mydata1, mydata5, by.x=c("country","year"), by.y=c("nations","time")) edit(mydata) PU/DSS/OTR 7

mydata1 MERGE EXAMPLE 5 (different variables, same name) mydata6 When common ids have different names use by.x and by.y to match them. R will keep the name of the first dataset (by.x) When different variables from two different dataset have the same name, R will assign a suffix.x or.y to make them unique and to identify which dataset they are coming from. mydata <- merge(mydata1, mydata6, by.x=c("country","year"), by.y=c("nations","time")) edit(mydata) PU/DSS/OTR 8

APPEND PU/DSS/OTR 9

APPEND EXAMPLE 1 mydata7 mydata <- rbind(mydata7, mydata8) edit(mydata) mydata8 PU/DSS/OTR 10

APPEND EXAMPLE 1 (cont.) sorting by country/year Notice the square brackets and parenthesis attach(mydata) mydata_sorted <- mydata[order(country, year),] detach(mydata) edit(mydata_sorted) mydata_sorted PU/DSS/OTR 11

mydata7 APPEND EXAMPLE 2 one dataset missing one variable mydata9 If one variable is missing in one dataset you will get an error message mydata <- rbind(mydata7, mydata9) Error in rbind(deparse.level,...) : numbers of columns of arguments do not match Possible solutions: Option A) Drop the extra variable from one of the datasets (in this case mydata7) mydata7$x3 <- NULL Option B) Create the variable with missing values in the incomplete dataset (in this case mydata9) mydata9$x3 <- NA Run the rbind-- function again. PU/DSS/OTR 12

References/Useful links Main references for this document: UCLA R class notes: http://www.ats.ucla.edu/stat/r/notes/managing.htm Quick-R: http://www.statmethods.net/management/merging.html DSS Online Training Section http://dss.princeton.edu/training/ Princeton DSS Libguides http://libguides.princeton.edu/dss John Fox s site http://socserv.mcmaster.ca/jfox/ Quick-R http://www.statmethods.net/ UCLA Resources to learn and use R http://www.ats.ucla.edu/stat/r/ DSS - R http://dss.princeton.edu/online_help/stats_packages/r PU/DSS/OTR 13

References/Recommended books An R Companion to Applied Regression, Second Edition / John Fox, Sanford Weisberg, Sage Publications, 2011 Data Manipulation with R / Phil Spector, Springer, 2008 Applied Econometrics with R / Christian Kleiber, Achim Zeileis, Springer, 2008 Introductory Statistics with R / Peter Dalgaard, Springer, 2008 Complex Surveys. A guide to Analysis Using R / Thomas Lumley, Wiley, 2010 Applied Regression Analysis and Generalized Linear Models / John Fox, Sage, 2008 R for Stata Users / Robert A. Muenchen, Joseph Hilbe, Springer, 2010 Introduction to econometrics / James H. Stock, Mark W. Watson. 2nd ed., Boston: Pearson Addison Wesley, 2007. Data analysis using regression and multilevel/hierarchical models / Andrew Gelman, Jennifer Hill. Cambridge ; New York : Cambridge University Press, 2007. Econometric analysis / William H. Greene. 6th ed., Upper Saddle River, N.J. : Prentice Hall, 2008. Designing Social Inquiry: Scientific Inference in Qualitative Research / Gary King, Robert O. Keohane, Sidney Verba, Princeton University Press, 1994. Unifying Political Methodology: The Likelihood Theory of Statistical Inference / Gary King, Cambridge University Press, 1989 Statistical Analysis: an interdisciplinary introduction to univariate & multivariate methods / Sam Kachigan, New York : Radius Press, c1986 Statistics with Stata (updated for version 9) / Lawrence Hamilton, Thomson Books/Cole, 2006 PU/DSS/OTR 14