Dataframes. Lecture 8. Nicholas Christian BIOST 2094 Spring 2011



Similar documents
Reading and writing files

Getting your data into R

R data import and export

Importing Data into R

Using R for Windows and Macintosh

The Saves Package. an approximate benchmark of performance issues while loading datasets. Gergely Daróczi

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) # seconds, 4.23GB of memory used by R

Data Storage STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

R FOR SAS AND SPSS USERS. Bob Muenchen

Time Clock Import Setup & Use

Introduction to the data.table package in R

Ad Hoc Advanced Table of Contents

Package HadoopStreaming

Microsoft Access 2010 Overview of Basics

User Manual - Sales Lead Tracking Software

Package retrosheet. April 13, 2015

Salary. Cumulative Frequency

Radius Maps and Notification Mailing Lists

Step by Step Guide to Importing Genetic Data into JMP Genomics

How to Download Census Data from American Factfinder and Display it in ArcMap

Personal Power Dialer

Microsoft Excel 2007 Mini Skills Overview of Tables

Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller

Access Queries (Office 2003)

How To Use Optimum Control EDI Import. EDI Invoice Import. EDI Supplier Setup General Set up

IRA Pivot Table Review and Using Analyze to Modify Reports. For help,

MicroStrategy Desktop

5 Correlation and Data Exploration

VDF Query User Manual

Excel for Data Cleaning and Management

Paper An Introduction to SAS PROC SQL Timothy J Harrington, Venturi Partners Consulting, Waukegan, Illinois

Microsoft Office 2010: Access 2010, Excel 2010, Lync 2010 learning assets

Creating CSV Files. Recording Data in Spreadsheets

Microsoft Access 2010 Part 1: Introduction to Access

User Guide. Trade Finance Global. Reports Centre. October nordea.com/cm OR tradefinance Name of document 8/8 2015/V1

Chapter 2 The Data Table. Chapter Table of Contents

Microsoft Access 3: Understanding and Creating Queries

Preparing your data for analysis using SAS. Landon Sego 24 April 2003 Department of Statistics UW-Madison

Form And List. SuperUsers. Configuring Moderation & Feedback Management Setti. Troubleshooting: Feedback Doesn't Send

Query 4. Lesson Objectives 4. Review 5. Smart Query 5. Create a Smart Query 6. Create a Smart Query Definition from an Ad-hoc Query 9

Microsoft Office. Mail Merge in Microsoft Word

Designed by Jason Wagner, Course Web Programmer, Office of e-learning ZPRELIMINARY INFORMATION... 1 LOADING THE INITIAL REPORT... 1 OUR EXAMPLE...

Jet Data Manager 2012 User Guide

IBM SPSS Direct Marketing 23

SonicWALL GMS Custom Reports

What's New in ADP Reporting?

SnapLogic Tutorials Document Release: October 2013 SnapLogic, Inc. 2 West 5th Ave, Fourth Floor San Mateo, California U.S.A.

Wave Analytics Data Integration

Infinite Campus Ad Hoc Reporting Basics

Importing Xerox LAN Fax Phonebook Data from Microsoft Outlook

Lab 9 Access PreLab Copy the prelab folder, Lab09 PreLab9_Access_intro

EXCEL PIVOT TABLE David Geffen School of Medicine, UCLA Dean s Office Oct 2002

Intellicus Enterprise Reporting and BI Platform

2: Entering Data. Open SPSS and follow along as your read this description.

SPSS for Windows importing and exporting data

Exporting Client Information

BASIS International Ltd. - March DocOut - Interfacing Reports with the Barista Document Management System

Excel Database Management Microsoft Excel 2003

User s Guide for the Texas Assessment Management System

Stata 12 Merging Guide. Nathan Favero Texas A&M University October 19, 2012

Results CRM 2012 User Manual

Exporting Addresses. for Use in Outlook Express

Microsoft Access Rollup Procedure for Microsoft Office Click on Blank Database and name it something appropriate.

How to Import Data into Microsoft Access

Uploading Ad Cost, Clicks and Impressions to Google Analytics

How to Create and Send a Froogle Data Feed

Business Objects Enterprise version 4.1. Report Viewing

Excel: Introduction to Formulas

Microsoft Excel Tips & Tricks

Work with the MiniBase App

Research Electronic Data Capture Prepared by Angela Juan

ACCESS Importing and Exporting Data Files. Information Technology. MS Access 2007 Users Guide. IT Training & Development (818)

Database Applications Microsoft Access

Using Microsoft Access Databases

Introduction to R June 2006

Ohio University Computer Services Center August, 2002 Crystal Reports Introduction Quick Reference Guide

Deal with big data in R using bigmemory package

NØGSG DMR Contact Manager

MIXED MODEL ANALYSIS USING R

Exporting Contact Information

REP200 Using Query Manager to Create Ad Hoc Queries

Getting Started with R and RStudio 1

Step 2: Save the file as an Excel file for future editing, adding more data, changing data, to preserve any formulas you were using, etc.

Data and Projects in R- Studio

Microsoft Access Basics

Resources You can find more resources for Sync & Save at our support site:

IBM SPSS Direct Marketing 22

Reading Delimited Text Files into SAS 9 TS-673

Introduction to IBM Watson Analytics Data Loading and Data Quality

Microsoft Access 2010 handout

Recording Supervisor Manual Presence Software

HYPERION DATA RELATIONSHIP MANAGEMENT RELEASE BATCH CLIENT USER S GUIDE

How To Connect Legrand Crm To Myob Exo

Creating tables in Microsoft Access 2007

Using Adobe Dreamweaver CS4 (10.0)

Transcription:

Dataframes Lecture 8 Nicholas Christian BIOST 2094 Spring 2011

Outline 1. Importing and exporting data 2. Tools for preparing and cleaning datasets Sorting Duplicates First entry Merging Reshaping Missing values 2 of 21

Dataframes R refers to datasets as dataframes A dataframe is a matrix-like structure, where the columns can be of different types. You can also think of a dataframe as a list. Each column is an element of the list and each element has the same length. To get/replace elements of a dataframe use either [ ] or $. The [ ] are used to access rows and columns and the $ is used to get entire columns # state.x77, is a built-in R dataset of state facts stored as a matrix # Type data(), to see a list of built-in datasets data <- data.frame(state.x77) # First, convert to a dataframe head(data) tail(data) # Print the first few rows of a dataset/matrix # Print the last few rows of a dataset/matrix names(data) colnames(data); rownames(data) dim(data) data[,c("population", "Income")] data$area data[1:5,] # Column names # Column and row names # Dimension of the dataframe # "Population" and "Income" columns # Get the column "Area" # Get the first five rows 3 of 21

Importing Data from the Keyboard The scan() function is used to type or paste a few numbers into a vector via the keyboard. Press enter after typing each number and press enter twice to stop entering data Can also paste multiple numbers at each prompt. scan() can also be used to import datasets. It is a very flexible function, but is also harder to use. The function read.table() provides an easier interface. > x <- scan() 1: 4 2: 6 3: 10 4: 11 5: Read 4 items > x [1] 4 6 10 11 4 of 21

Importing Data from Files The function read.table() is the easiest way to import data into R. The preferred raw data format is either a tab delimited or a comma-separate file (CSV). The simplest and recommended way to import Excel files is to do a Save As in Excel and save the file as a tab delimited or CSV file and then import this file in to R. Similarly, for SAS files export the file as a tab delimited or CSV file using proc export. Functions for importing data, read.table() read.csv() read.fwf() Reads a file in table format and creates a dataframe Same as read.table() where sep="," Read a table of fixed width formatted data. Data that is not delimited, need to specify the width of the fields. 5 of 21

read.table() file header sep skip as.is read.table(file, header = FALSE, sep = "", skip, as.is, stringsasfactors=true) stringsasfactors The name of the file to import Logical, does the first row contain column labels Field separator character sep=" " space (default) sep="\t" tab-delimited sep="," comma-separated Number of lines to skip before reading data Vector of numeric or character indices which specify which columns should not be converted to factors Logical should character vectors be converted to factors By default R converts character string variables to factor variables, use stringsasfactors or as.is to change the default There are many more arguments for read.table that allow you to adjust for the format of your data 6 of 21

read.table() The default field separator is a space, if you use this separator keep in mind that, Column names cannot have spaces Character data cannot have spaces There can be trouble interpreting extra blank spaces as missing data. Need to include missing data explicitly in the dataset by using NA. Recommended approach is to use commas to separate the fields. By importing a CSV file we don t have any of the problems that can occur when you use a space. To avoid typing the file path either set the working directory or use file.choose() 7 of 21

Example - read.table() data <- read.table("example.csv", header=true, sep=",", skip=7) str(data) # Gives the structure of data # Have all character strings not be treated as a factor variable data <- read.table("example.csv", header=true, sep=",", skip=7, stringsasfactors=false) str(data) # Have character strings in selected columns not be treated as a factor data <- read.table("example.csv",header=true,sep=",", skip=7, as.is="dr") str(data) # Have the start date be a Date object data <- read.table("example.csv",header=true,sep=",", skip=7, as.is="dr") Start.Date <- as.date(data$d.start, "%m/%d/%y") cbind(data[,-7], Start.Date) # Load data using file.choose() data <- read.table(file.choose(), header=true, sep=",", skip=7) 8 of 21

Attaching a Dataframe The function search() displays the search path for R objects. When R looks for an object it first looks in the global environment then proceeds through the search path looking for the object. The search path lists attached dataframes and loaded libraries. The function attach() (detach()) attaches (detaches) a dataframe to the search path. This means that the column names of the dataframe are searched by R when evaluating a variable, so variables in the dataframe can be accessed by simply giving their names. head(co2) # CO2 dataset, carbon dioxide uptake in grass plants mean(co2$uptake) # Average uptake search() attach(co2) search() mean(uptake) # Search path # Attach dataset # If R can find the variable in the global environment # it first looks in CO2 # Average uptake detach(co2) # Detach dataset search() 9 of 21

Caution with Attaching a Dataframe attach() is okay if you are just working on one dataset and your purpose is mostly on analysis, but if you are going to have several datasets and lots of variables avoid using attach(). If you attach a dataframe and use simple names like x and y, it is very possible to have very different objects with the same name which can cause problems R prints a warning message if attaching a dataframe causes a duplication of one or more names. Several modeling functions like lm() and glm() have a data argument so there is no need to attach a dataframe For functions without a data argument use with(). This function evaluates an R expression in an environment constructed from the dataframe. If you do use attach() call detach() when you are finished working with the dataframe to avoid errors. 10 of 21

Example - Caution with attach() # Type defined in the global environment Type <- gl(2, 5, labels=c("type A", "Type B")) attach(co2) # Attach CO2 Type # Get Type from global environment (first in search()) # instead of from CO2 detach(co2) # Use with instead of attach, can also be simpler than using the $ with(co2, table(type, Plant)) with(co2, boxplot(uptake~treatment)) # Some modeling functions have a data argument lm(uptake ~ Treatment, data=co2) 11 of 21

Importing datasets using scan() The scan() function is very flexible, but as a result is also harder to use then read.table(). However, scan() can be useful when you need to import odd datasets. For example, suppose we want to input a dataset that has a different number of values on each line. # Determine the number of lines, by using the new-line character # as a field separator lines <- length(scan("odddata.txt", sep="\n")) # Read one line at a time, skipping to the current line sapply(0:(lines-1), function(x) scan("odddata.txt", skip=x, nlines=1)) 12 of 21

Exporting Datasets write.table(x, file="", sep=" ", row.names=true, col.names=true) x file sep row.names col.names The object to be saved, either a matrix or dataframe File name Field separator Logical, include row.names Logical, include col.names There is also a wrapper function, write.csv() for creating a CSV file by calling write.table() with sep=",". # Export R dataframe as a CSV file write.table(data, "export.example.csv", sep=",", row.names=false) write.table(data, file.choose(new=true), sep=",", row.names=false) 13 of 21

Exporting Console Output The function capture.output() evaluates its argument and either returns a character string or sends the output to a file Use sink() to divert all output to a file x <- rnorm(100) y <- rnorm(100) fit <- lm(y~x) capture.output(summary(fit)) capture.output(summary(fit), file="lm.out.txt") # Send all output to a file (not expressions) sink("lm.sink.txt") # Start sink mean(x) summary(fit) sink() # Stop sink 14 of 21

Sorting Sort a dataframe by row using order() and [ ] attach(co2) # Since we will only be dealing with variables in CO2 # order() returns a vector of positions that correspond to the # argument vector, rearranged so that the elements are sorted order(uptake) uptake[order(uptake)] # Same as sort(uptake) # Sort CO2 by uptake, descending CO2[order(uptake),] # Sort CO2 by uptake, ascending CO2[rev(order(uptake)),] CO2[order(uptake, decreasing=true),] # Sort CO2 by conc then uptake CO2[order(conc, uptake),] # Sort CO2 by conc (ascending) then uptake (descending), only for numeric data CO2[order(conc, -uptake),] # Sort CO2 by Plant, an ordered factor variable CO2[order(Plant),] # Sort CO2 by Plant (descending) then uptake (ascending) # rank() converts Plant to numeric CO2[order(-rank(Plant), uptake),] 15 of 21

Duplicates The function unique() will return a dataframe with the duplicate rows or columns removed. Use the incomparables argument to specify what values not to compare duplicated() returns a logical vector indicating which rows are duplicates NOTE! unique() and duplicated() only work for imported dataframes and does not work for dataframes created during an R session # Load Data data <- read.table("example.csv",header=true, sep=",", skip=8, as.is="dr") Start.Date <- as.date(data$d.start, "%m/%d/%y") data <- cbind(data[,-7], Start.Date) unique(data) data[duplicated(data),] # Dataset with duplicate rows removed # Duplicate rows 16 of 21

Matching The match() function can be used to select the first entry of a variable. # Load Data data <- read.table("example.csv",header=true,sep=",", skip=7, as.is="dr") visit.date <- as.date(data$visit, "%m/%d/%y") data <- cbind(data[,-2], visit.date) # Remove duplicates data.nodup <- unique(data) # Sort by ID then visit date so that the first entry for each patient # is the first visit data.nodup.sort <- with(data.nodup, data.nodup[order(id, visit.date),]) # Get the first visit of each patient first <- with(data.nodup.sort, match(unique(id), ID)) data.nodup.sort[first,] 17 of 21

Merging Use merge() to merge two and only two dataframes # Merge two datasets by an ID variable, where ID is the same for both datasets data1 <- data.frame(id=1:5, x=letters[1:5]) data2 <- data.frame(id=1:5, y=letters[6:10]) merge(data1, data2) # Merge two datasets by an ID variable, where ID is not the same for both datasets data1 <- data.frame(id=1:5, x=letters[1:5]) data2 <- data.frame(id=4:8, y=letters[6:10]) merge(data1, data2) merge(data1, data2, all=true) merge(data1, data2, all.x=true) # Only keep the rows from the 1st argument data1 merge(data1, data2, all.y=true) # Only keep the rows from the 2nd argument data2 # Merge two datasets by an ID variable, where both dataset have the same names data1 <- data.frame(id=1:5, x=letters[1:5]) data2 <- data.frame(id=1:5, x=letters[6:10]) merge(data1, data2, all=true) # Add rows merge(data1, data2, by="id") # Add columns merge(data1, data2, by="id", suffixes=c(1, 2)) # Merge two datasets by an ID variable, where the ID variable has a different name data1 <- data.frame(id1=1:5, x=letters[1:5]) data2 <- data.frame(id2=1:5, x=letters[6:10]) merge(data1, data2, by.x="id1", by.y="id2") 18 of 21

Reshaping Use reshape() to transform data from wide format to long format and vice versa. # Reshape a wide dataset into long format # The variable names t.1 and t.2 are intentional, use ".1" and ".2" suffixes # to make the reshaping much easier wide <- data.frame(id=1:5,x=c(10,20,30,40,50),t.1=letters[1:5],t.2=letters[22:26]) reshape(wide, varying=c("t.1", "t.2"), direction="long") reshape.long <- reshape(wide, varying=c("t.1", "t.2"), direction="long", timevar="visit", idvar="index", drop="id") # A reshaped dataset contains attributes that make it easy to move # between long and wide format, after the first reshape reshape(reshape.long, direction="wide") # Reshape a long dataset into wide format long <- data.frame(id=rep(1:5, 2), x=rep(c(10,20,30,40,50), 2), t=c(letters[1:5], letters[22:26]), time=rep(1:2, each=5)) reshape(long, idvar="id", v.names="t", timevar="time", direction="wide") reshape.wide <- reshape(long, idvar="id", v.names=c("t", "x"), timevar="time", direction="wide") # Easily go back to wide format reshape(reshape.wide, direction="long") 19 of 21

Stacking/Unstacking The stack() function concatenates multiple vectors from separate columns of a dataframe into a single vector along with a factor indicating where each observation originated; unstack() reverses this operation. data <- data.frame(label=rep(c("a", "B", "C"), 3), value=sample(9)) # First argument is an object to be stacked or unstacked # Second argument is a formula, "values to unstack" ~ "groups to create" uns <- unstack(data, value~label) # Re-stack stack(uns) # Select which columns to stack data <- data.frame(a=sample(1:9, 3), B=sample(1:9, 3), C=sample(1:9, 3)) stack(data, select=c("a", "B")) 20 of 21

Missing Data Use na.omit() to remove missing data from a dataset and use na.fail() to signal an error if a dataset contains NA complete.cases() returns a logical vector indicating which rows have no missing data data <- data.frame(x=c(1,2,3), y=c(5, NA, 8)) na.omit(data) # Remove all rows with missing data # Use na.fail to test if a dataset is complete NeedCompleteData <- function(data) { na.fail(data) # Return an error message if missing data lm(y, x, data=data) } NeedCompleteData(data) sum(complete.cases(data)) # Get the number of complete cases sum(!complete.cases(data)) # Get the number of incomplete cases 21 of 21