Getting Started with R and RStudio 1

Similar documents
4 Other useful features on the course web page. 5 Accessing SAS

OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Prof. Nicolai Meinshausen Regression FS R Exercises

Graphical Representation of Multivariate Data

GeoGebra Statistics and Probability

Scatter Plots with Error Bars

A Short Guide to R with RStudio

Graphics in R. Biostatistics 615/815

Appendix 2.1 Tabular and Graphical Methods Using Excel

Beginner s Matlab Tutorial

SPSS: Getting Started. For Windows

Introduction to RStudio

MetroBoston DataCommon Training

Novell ZENworks Asset Management 7.5

Using R for Windows and Macintosh

Microsoft Excel 2010 Part 3: Advanced Excel

Exploratory Data Analysis and Plotting

Query 4. Lesson Objectives 4. Review 5. Smart Query 5. Create a Smart Query 6. Create a Smart Query Definition from an Ad-hoc Query 9

Finance Reporting. Millennium FAST. User Guide Version 4.0. Memorial University of Newfoundland. September 2013

Excel 2010: Create your first spreadsheet

SPSS Manual for Introductory Applied Statistics: A Variable Approach

Computational Statistics Using R and R Studio An Introduction for Scientists

COLLABORATION NAVIGATING CMiC

Managing users. Account sources. Chapter 1

R: A self-learn tutorial

Getting Started With SPSS

Linking Telemet Orion to a Portfolio Accounting System

Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller

Chapter 15: Forms. User Guide. 1 P a g e

Data exploration with Microsoft Excel: univariate analysis

Virtual Exhibit 5.0 requires that you have PastPerfect version 5.0 or higher with the MultiMedia and Virtual Exhibit Upgrades.

ABSTRACT INTRODUCTION EXERCISE 1: EXPLORING THE USER INTERFACE GRAPH GALLERY

SAS BI Dashboard 4.3. User's Guide. SAS Documentation

Strategic Information Reporting Initiative (SIRI) User Guide for Student Dashboard

Using SPSS, Chapter 2: Descriptive Statistics

MiraCosta College now offers two ways to access your student virtual desktop.

Creating Personal Web Sites Using SharePoint Designer 2007

Note: With v3.2, the DocuSign Fetch application was renamed DocuSign Retrieve.

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

EBOX Digital Content Management System (CMS) User Guide For Site Owners & Administrators

Web Intelligence User Guide

Instructions for Configuring a SAS Metadata Server for Use with JMP Clinical

SonicWALL GMS Custom Reports

How to Use the H-ITT Analyzer Version 2.4.4

Appendix A How to create a data-sharing lab

Bank Account 1 September 2015

Qualtrics Survey Tool

Oracle BI Extended Edition (OBIEE) Tips and Techniques: Part 1

Hamline University Administrative Computing Page 1

MyOra 3.0. User Guide. SQL Tool for Oracle. Jayam Systems, LLC

BULK SMS USER GUIDE. Version 2.0 1/18

Strategic Asset Tracking System User Guide

HealthyCT Online Bill Pay

Financial Econometrics MFE MATLAB Introduction. Kevin Sheppard University of Oxford

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Exploratory data analysis (Chapter 2) Fall 2011

Basic Introduction. GMFX MetaTrader 4.0. Basic Introduction

Remedy ITSM Service Request Management Quick Start Guide

Tutorial 3. Maintaining and Querying a Database

Microsoft Office 2010

Spreadsheets and Laboratory Data Analysis: Excel 2003 Version (Excel 2007 is only slightly different)

Using Formulas, Functions, and Data Analysis Tools Excel 2010 Tutorial

Data exploration with Microsoft Excel: analysing more than one variable

Working with Data from External Sources

R and Rcmdr : Basic Functions for Managing Data

How to FTP (How to upload files on a web-server)

Alteryx Predictive Analytics for Oracle R

McAfee Endpoint Encryption Reporting Tool

Universal Simple Control, USC-1

WebSphere Business Monitor V6.2 Business space dashboards

FrontPage 2003: Forms

Getting Started Guide

Kurz MODBUS Client User s Guide

Viewing Ecological data using R graphics

Indiana County Assessor Association Excel Excellence

Microsoft Access Introduction

Business Insight Report Authoring Getting Started Guide

Creating and Managing Online Surveys LEVEL 2

Appendix 1 Install RightNow on your PC

Secure Messaging Quick Reference Guide

Working with Excel in Origin

AMS 7L LAB #2 Spring, Exploratory Data Analysis

CloudCTI Recognition Configuration Tool Manual

R with Rcmdr: BASIC INSTRUCTIONS

Aras Corporation Aras Corporation. All rights reserved. Notice of Rights. Notice of Liability

An introduction to using Microsoft Excel for quantitative data analysis

Excel 2007 Basic knowledge

Descriptive Statistics

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

Network Detective Client Connector

WebSphere Business Monitor V7.0 Business space dashboards

1 Topic. 2 Scilab. 2.1 What is Scilab?

In this article, learn how to create and manipulate masks through both the worksheet and graph window.

Introduction To Microsoft Office PowerPoint Bob Booth July 2008 AP-PPT5

Division of School Facilities OUTLOOK WEB ACCESS

STC: Descriptive Statistics in Excel Running Descriptive and Correlational Analysis in Excel 2013

STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc.

Microsoft Access 2007 Introduction

Getting Started with Microsoft Office Live Meeting. Published October 2007 Last Update: August 2009

Transcription:

Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following reasons: 1. R is free and open-source. 2. R is user-extensible and user extensions can easily be made available to all users. 3. R is commercial quality. It is the package of choice for many engineers who use statistics frequently. 4. R is easy to use. No doubt you will hear some disagreement about point 4 above. Other data analysis tools (such as Excel, for example) appear easier to use at first. But many things that an engineer might do are easier to do in R than Excel and some are impossible to do in Excel (correctly). In Mathematics 241 we will focus on core statistical tools and so learn to use only a small fraction of the capabilities of R. But since R is free, you will be able to keep R and add to your knowledge of it throughout your career. 2 Using R on the Cloud R can easily be downloaded and installed on your personal computer. However we will use R over the internet by using a system called RStudio. There are advantages and disadvantages to using R over the internet but the principle advantages for this course are that using RStudio means that the installation and setup of R is taken care of and also data can easily be shared with the instructor and other students. (Instructions on how to download and setup R and RStudio for your own computer are available at the course webpage.) To use RStudio on the web, go to http://dahl.calvin.edu:8787. Initially, your Calvin ID works as both your ID and password. (You can change your password by going to http://dahl.calvin.edu:4200 and entering the yppasswd command when you finally get a command prompt.) The webpage that comes up after logging in should look like this: Notice that there are four panes. R commands are entered one line at a time into the Console pane (which is the lower left pane by default but that can be changed). In the standalone version of R that you can install, there is only a console window at the start. The other three panes are used by RStudio to interact with the file

Getting Started with R and RStudio 2 system, to show graphics plots, and to provide an editor that can compose input for the console. (These notes are being produced in RStudio using the pane at the top left as an editor.) In the remainder of these notes we will work entirely within the console. Thus these notes can be used to get started in any version of R. The symbol > is the prompt symbol that signifies that R is ready for input. In general, in response to the prompt we enter a one-line command and get some output (or define some object). We will look at some of the basic kinds of objects and commands in the rest of these notes. 3 Basic features of R In the examples that follow, you can distiguish input from output by the input prompt symbol >. Try these commands or variations of them yourself. 3.1 Using R as a Calculator R can be used as a calculator. > 5 + 3 [1] 8 > 15.3 * 23.4 [1] 358.02 > sqrt(16) [1] 4 You can save values to named variables for later reuse > product = 15.3 * 23.4 # save result > product # show the result [1] 358.02 >.5 * product # half of the result [1] 179.01 > log(product) # log of the result [1] 5.880589 > product <- 15.3 * 23.4 # <- is the assignment operator, same as = > 15.3 * 23.4 -> newproduct # can assign to the right hand side > newproduct [1] 358.02 The semi-colon can be used to place multiple commands on one line. One frequent use of this is to save and print a value all in one go: > product <- 15.3 * 23.4; product # save result and show it [1] 358.02 3.2 Functions and Objects Though R does arithmetic on numbers, the real power of R comes from the fact that R understands complex objects and has a large library of functions that operate on those objects. So most of the R commands that we will enter will look like f(x,y,...) where f is the name of an R function (like log above) and x,y,... is a list of objects. In the next section we illustrate by introducing the vector object and give some examples of functions that operate on vectors.

Getting Started with R and RStudio 3 3.3 Vectors A vector has a length (a non-negative integer) and a mode (numeric, character, complex, or logical). All elements of the vector must be of the same mode. Typically, we use a vector to store the values of a quantitative variable. Usually vectors will be constructed by reading data from an R dataset or a file as we will soon see. But short vectors can be constructed by entering the elements directly. > x = c(1,3,5,7,9,8,6,4,2) > x [1] 1 3 5 7 9 8 6 4 2 Note that the [1] that precedes the elements of the vectors is not one of the elements but rather an indication that the first element of the vector follows. There are a couple of shortcuts that help construct vectors that are regular. > y=1:10 > z=seq(0,5,.05) > y;z [1] 1 2 3 4 5 6 7 8 9 10 [1] 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 [16] 0.75 0.80 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20 1.25 1.30 1.35 1.40 1.45 [31] 1.50 1.55 1.60 1.65 1.70 1.75 1.80 1.85 1.90 1.95 2.00 2.05 2.10 2.15 2.20 [46] 2.25 2.30 2.35 2.40 2.45 2.50 2.55 2.60 2.65 2.70 2.75 2.80 2.85 2.90 2.95 [61] 3.00 3.05 3.10 3.15 3.20 3.25 3.30 3.35 3.40 3.45 3.50 3.55 3.60 3.65 3.70 [76] 3.75 3.80 3.85 3.90 3.95 4.00 4.05 4.10 4.15 4.20 4.25 4.30 4.35 4.40 4.45 [91] 4.50 4.55 4.60 4.65 4.70 4.75 4.80 4.85 4.90 4.95 5.00 Many functions operate on vectors component-wise. > x=1:5 > y=6:10 > x^2 [1] 1 4 9 16 25 > x+y [1] 7 9 11 13 15 > log(x) [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379

Getting Started with R and RStudio 4 3.4 Data Frames Data sets are usually stored in a special structure called a data frame. Data frames have a 2-dimensional structure. Rows correspond to the individuals (observational units, cases, subjects) of our data set and the columns correspond to variables (measurements collected on each individual). Data frames in R are named as are the individual variables of the data frame. The columns (variables) are either vectors or factors (think of a factor as a vector that stores a categorical variable). We will usually get our data frames from external files that we have prepared in some other way Excel is a good way to prepare a data frame as a data frame looks like a spreadsheet. Some datasets are included with the default R installation. The iris data frame contains 5 variables measured for each of 150 iris plants. The iris data set is included with the default R installation. > str(iris) # summarizes the structure of the data frame data.frame : 150 obs. of 5 variables: $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9... $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1... $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5... $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1... $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1... > summary(iris) # gives summary information on each variable Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Species setosa :50 versicolor:50 virginica :50 > head(iris) # prints the first several cases of the data frame Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa In interactive mode, you can also try > View(iris) to see the data or >?iris to get the documentation about for the data set.

Getting Started with R and RStudio 5 Access to an individual variable in a data frame uses the $ operator in the following syntax: > dataframe$variable For example, > iris$sepal.length [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5 [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1 [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 [109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 [127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8 [145] 6.7 6.7 6.3 6.5 6.2 5.9 shows the contents of the Sepal.Length variable. But this isn t very useful for a large data set. We would prefer to compute a numerical or graphical summary. 4 Summaries of a single quantitative variable Almost always, a quantitative variable is stored in a vector and that vector is one column of a data frame. Most functions that give a numerical or graphical summary require a vector as argument. In this section we illustrate some of the more important summary functions with the variable Sepal.Length of the data frame iris. 4.1 Numerical summaries > mean(iris$sepal.length) [1] 5.843333 > median(iris$sepal.length) [1] 5.8 > sd(iris$sepal.length) [1] 0.8280661 > quantile(iris$sepal.length) 0% 25% 50% 75% 100% 4.3 5.1 5.8 6.4 7.9 4.2 Graphical Summaries There are several ways to make graphs in R. Many individuals have written R packages that give great control over the way a graph is drawn. We will use the standard graphics functions that are built n to R. Here we illustrate the two most important graphical representations of a single quantitative variable. In RStudio, graphics output appears in the plot window (lower right). You must click on the Plots tab to see them. A histogram is drawn using the function hist. > hist(iris$sepal.length)

Getting Started with R and RStudio 6 Histogram of iris$sepal.length Frequency 0 5 15 25 4 5 6 7 8 iris$sepal.length Many functions in R have optional arguments that change the way that the function acts. Often we can omit these arguments since R chooses reasonable default values. Note that R produces frequency histograms. To produce a density histogram, we need an optional argument freq. Note that we name the argument. Optional arguments usually have to be named so that R knows which arguments are being included. Other optional arguments control the title of the histogram and the axis labels. > hist(iris$sepal.length,freq=f,main="sepal Length",xlab=" ") # F is short for false Sepal Length Density 0.0 0.1 0.2 0.3 0.4 4 5 6 7 8

Getting Started with R and RStudio 7 Another common plot is called a boxplot. A boxplot is a graphical representation of a five number summary of a quantitative variable. The default boxplot uses a vertical scale. Here we draw a horizontal boxplot. > boxplot(iris$sepal.length, horizontal=t, main="sepal Length") # T is short for true Sepal Length 4.5 5.5 6.5 7.5 5 Importing data In this class, we will use data from several different sources. R has many builtin datasets. (The iris dataset used earlier in these notes in one of those.) There are also many packages available that provide additional datasets and also extend R by defining useful functions. Packages are installed and loaded via the Package tab of the files panes of RStudio. We will also use datasets developed especially for this class. Each RStudio user has space to save files. You can see your personal directory using the files tab of the same window in which you look at plots. Each user has a Public directory which is visible to other RStudio users. There are two collections of data that are available through the instructor s public directory. The directory Navidi contains the datasets from the textbook. Other datasets used in this course are also included there. To load such datasets, use the Import Dataset tab of the Workspace pane, select From Text File and enter as filename /home/stob/data. You will see the following Class datasets are in this directory and textbook datasets are in the Navidi directory. For example, to import the dimes dataset, simply select dimes.csv. A window will popup that enables you to tell R which format the data is in but in this case RStudio understands the CSV format that the dimes dataset is in. This procedure defines the

Getting Started with R and RStudio 8 data frame dimes. To load a textbook dataset, navigate to the Navidi folder and select the appropriate chapter and then file in that chapter: for example ex3-2-5.txt is the data for exercise 5 in section 2 of chapter 3. Note that you will have to change the variable name (from ex-3-2-5) since dashes are not acceptable characters in variable names. Choose a short, memorable variable name! 6 Useful features of RStudio One of the most useful features of RStudio is that it will save the state of your session even if you close your browser. This includes all variables, plots, and other settings. This is very useful for class work since you might get stuck on homework after attempting a problem and can pick up again after you get help in class or from the instructor. Another useful feature is the History tab of the upper right hand window. In that window you can find all the lines that you have entered into the console. These lines can be copied into the console window, for example. The Source pane can be used to edit and save your work. You can run command lines entered into this pane using the appropriate buttons. If you have an error, you can simply edit that line in the Source pane. Two useful editing features are accessed by the tab key and the arrow keys. If you start to type the name of a function (e.g., > hi) and enter the tab character, you get all the possible functions that begin with these characters along with a short description of what they do. (Try entering the tab key after entering hi.) If you hit the up-arrow key, the previous line that you entered now becomes the current line and you can edit it and enter it again. This is very useful if you make a small typo on a very long line.