2 Getting Started with R and RStudio 2 system, to show graphics plots, and to provide an editor that can compose input for the console. (These notes are being produced in RStudio using the pane at the top left as an editor.) In the remainder of these notes we will work entirely within the console. Thus these notes can be used to get started in any version of R. The symbol > is the prompt symbol that signifies that R is ready for input. In general, in response to the prompt we enter a one-line command and get some output (or define some object). We will look at some of the basic kinds of objects and commands in the rest of these notes. 3 Basic features of R In the examples that follow, you can distiguish input from output by the input prompt symbol >. Try these commands or variations of them yourself. 3.1 Using R as a Calculator R can be used as a calculator. > [1] 8 > 15.3 * 23.4 [1] > sqrt(16) [1] 4 You can save values to named variables for later reuse > product = 15.3 * 23.4 # save result > product # show the result [1] >.5 * product # half of the result [1] > log(product) # log of the result [1] > product < * 23.4 # <- is the assignment operator, same as = > 15.3 * > newproduct # can assign to the right hand side > newproduct [1] The semi-colon can be used to place multiple commands on one line. One frequent use of this is to save and print a value all in one go: > product < * 23.4; product # save result and show it [1] Functions and Objects Though R does arithmetic on numbers, the real power of R comes from the fact that R understands complex objects and has a large library of functions that operate on those objects. So most of the R commands that we will enter will look like f(x,y,...) where f is the name of an R function (like log above) and x,y,... is a list of objects. In the next section we illustrate by introducing the vector object and give some examples of functions that operate on vectors.

3 Getting Started with R and RStudio Vectors A vector has a length (a non-negative integer) and a mode (numeric, character, complex, or logical). All elements of the vector must be of the same mode. Typically, we use a vector to store the values of a quantitative variable. Usually vectors will be constructed by reading data from an R dataset or a file as we will soon see. But short vectors can be constructed by entering the elements directly. > x = c(1,3,5,7,9,8,6,4,2) > x [1] Note that the [1] that precedes the elements of the vectors is not one of the elements but rather an indication that the first element of the vector follows. There are a couple of shortcuts that help construct vectors that are regular. > y=1:10 > z=seq(0,5,.05) > y;z [1] [1] [16] [31] [46] [61] [76] [91] Many functions operate on vectors component-wise. > x=1:5 > y=6:10 > x^2 [1] > x+y [1] > log(x) [1]

4 Getting Started with R and RStudio Data Frames Data sets are usually stored in a special structure called a data frame. Data frames have a 2-dimensional structure. Rows correspond to the individuals (observational units, cases, subjects) of our data set and the columns correspond to variables (measurements collected on each individual). Data frames in R are named as are the individual variables of the data frame. The columns (variables) are either vectors or factors (think of a factor as a vector that stores a categorical variable). We will usually get our data frames from external files that we have prepared in some other way Excel is a good way to prepare a data frame as a data frame looks like a spreadsheet. Some datasets are included with the default R installation. The iris data frame contains 5 variables measured for each of 150 iris plants. The iris data set is included with the default R installation. > str(iris) # summarizes the structure of the data frame data.frame : 150 obs. of 5 variables: \$ Sepal.Length: num \$ Sepal.Width : num \$ Petal.Length: num \$ Petal.Width : num \$ Species : Factor w/ 3 levels "setosa","versicolor",..: > summary(iris) # gives summary information on each variable Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. : st Qu.: st Qu.: st Qu.: st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Species setosa :50 versicolor:50 virginica :50 > head(iris) # prints the first several cases of the data frame Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa setosa setosa setosa In interactive mode, you can also try > View(iris) to see the data or >?iris to get the documentation about for the data set.

5 Getting Started with R and RStudio 5 Access to an individual variable in a data frame uses the \$ operator in the following syntax: > dataframe\$variable For example, > iris\$sepal.length [1] [19] [37] [55] [73] [91] [109] [127] [145] shows the contents of the Sepal.Length variable. But this isn t very useful for a large data set. We would prefer to compute a numerical or graphical summary. 4 Summaries of a single quantitative variable Almost always, a quantitative variable is stored in a vector and that vector is one column of a data frame. Most functions that give a numerical or graphical summary require a vector as argument. In this section we illustrate some of the more important summary functions with the variable Sepal.Length of the data frame iris. 4.1 Numerical summaries > mean(iris\$sepal.length) [1] > median(iris\$sepal.length) [1] 5.8 > sd(iris\$sepal.length) [1] > quantile(iris\$sepal.length) 0% 25% 50% 75% 100% Graphical Summaries There are several ways to make graphs in R. Many individuals have written R packages that give great control over the way a graph is drawn. We will use the standard graphics functions that are built n to R. Here we illustrate the two most important graphical representations of a single quantitative variable. In RStudio, graphics output appears in the plot window (lower right). You must click on the Plots tab to see them. A histogram is drawn using the function hist. > hist(iris\$sepal.length)

6 Getting Started with R and RStudio 6 Histogram of iris\$sepal.length Frequency iris\$sepal.length Many functions in R have optional arguments that change the way that the function acts. Often we can omit these arguments since R chooses reasonable default values. Note that R produces frequency histograms. To produce a density histogram, we need an optional argument freq. Note that we name the argument. Optional arguments usually have to be named so that R knows which arguments are being included. Other optional arguments control the title of the histogram and the axis labels. > hist(iris\$sepal.length,freq=f,main="sepal Length",xlab=" ") # F is short for false Sepal Length Density

7 Getting Started with R and RStudio 7 Another common plot is called a boxplot. A boxplot is a graphical representation of a five number summary of a quantitative variable. The default boxplot uses a vertical scale. Here we draw a horizontal boxplot. > boxplot(iris\$sepal.length, horizontal=t, main="sepal Length") # T is short for true Sepal Length Importing data In this class, we will use data from several different sources. R has many builtin datasets. (The iris dataset used earlier in these notes in one of those.) There are also many packages available that provide additional datasets and also extend R by defining useful functions. Packages are installed and loaded via the Package tab of the files panes of RStudio. We will also use datasets developed especially for this class. Each RStudio user has space to save files. You can see your personal directory using the files tab of the same window in which you look at plots. Each user has a Public directory which is visible to other RStudio users. There are two collections of data that are available through the instructor s public directory. The directory Navidi contains the datasets from the textbook. Other datasets used in this course are also included there. To load such datasets, use the Import Dataset tab of the Workspace pane, select From Text File and enter as filename /home/stob/data. You will see the following Class datasets are in this directory and textbook datasets are in the Navidi directory. For example, to import the dimes dataset, simply select dimes.csv. A window will popup that enables you to tell R which format the data is in but in this case RStudio understands the CSV format that the dimes dataset is in. This procedure defines the

8 Getting Started with R and RStudio 8 data frame dimes. To load a textbook dataset, navigate to the Navidi folder and select the appropriate chapter and then file in that chapter: for example ex3-2-5.txt is the data for exercise 5 in section 2 of chapter 3. Note that you will have to change the variable name (from ex-3-2-5) since dashes are not acceptable characters in variable names. Choose a short, memorable variable name! 6 Useful features of RStudio One of the most useful features of RStudio is that it will save the state of your session even if you close your browser. This includes all variables, plots, and other settings. This is very useful for class work since you might get stuck on homework after attempting a problem and can pick up again after you get help in class or from the instructor. Another useful feature is the History tab of the upper right hand window. In that window you can find all the lines that you have entered into the console. These lines can be copied into the console window, for example. The Source pane can be used to edit and save your work. You can run command lines entered into this pane using the appropriate buttons. If you have an error, you can simply edit that line in the Source pane. Two useful editing features are accessed by the tab key and the arrow keys. If you start to type the name of a function (e.g., > hi) and enter the tab character, you get all the possible functions that begin with these characters along with a short description of what they do. (Try entering the tab key after entering hi.) If you hit the up-arrow key, the previous line that you entered now becomes the current line and you can edit it and enter it again. This is very useful if you make a small typo on a very long line.

