Research Methods Group AN INTRODUCTION TO R SOFTWARE DATE: 20 t h MAY, 2008 Crispin Matere & Sonal Nagda
May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 2
CONTENTS An overview of the R package 3 1. R Installation 3 2. Loading and installation of R packages 4 2.1. Updating R and the packages 6 3. Running the Software 7 3.1. Changing your workspace 8 4. Getting help in R and also online help 10 5. How R works 15 5.1. Creating, listing and deleting objects 16 Calculating with numbers 18 Calculating with arrays 18 Assigning variables 19 Combining arrays 19 Using functions 19 Creating a basic plot 20 5.2. Printing from R 21 6. Working with data in R 21 6.1. R scripts 24 7. Importing data and data files 25 7.1. Importing a text data file 25 7.2. Accessing built in R datasets 26 7.3. Accessing data using the RODBC package 27 Reading Access database files with RODBC 27 Excel sheets with RODBC 27 8. The R Graphical User Interface (RGUI) [The R menus] 28 9. The R Manuals and other documents 30 Documents with more than 100 pages 30 Documents with fewer than 100 pages 31 Short Documents and Reference Cards 31 Task Views 31 R News 32 9.1. Mailing lists 32 10. Useful websites and references 33 Appendix 1: Example on use of R 35 Appendix 2: Introduction to mapping with R - By M. Lesnoff 48 Appendix 3: Few tips on R configuration - By M. Lesnoff 52 May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 3
An Overview of the R package R is an object-oriented scripting language that combines the programming language S developed by John Chambers (Chambers and Hastie 1988, Chambers 1998, Venables and Ripley 2000) with a large set of modern as well as the classical statistical data analysis and modeling, a set of functions for graphical visualization of data and model output, an extensive help facilities and a user interface with a few basic menus. As an environment of choice for data analysis and novel research techniques, R is rapidly growing in popularity, not only in research but also in the academic world. It has a wide range of support resources. It is continuously growing and above all it is available as Free Software (see http://www.gnu.org/ and http://www.rproject.org/copying). It compiles and runs on a wide variety of computing platforms including Windows and MacOS. Therefore with the increasing cost of commercial statistical software R is ultimately the right choice. Also R users can program their own code. R is interactive to the user thus providing immediate feedback. It is highly versatile and ideal for statistical simulations such as monte carlo, jackknifing and bootstrapping. Since R is command driven save for a limited menu system, perfect reproducibility of analysis is possible by properly documenting commands used in an analysis. Finally, R source code is available for both inspection and modification where need be. However, in order to realize full benefits of programming in R, some knowledge in matrix manipulation is required. Furthermore only a few tasks can be accomplished through the menu system and therefore it requires some effort to learn to use R. 1. R Installation You want to install R on the hard disk of your PC. If your PC has access to the internet then it is much simpler to achieve this. The first step is to go to the site called Comprehensive R Archive Network (CRAN). It is a collection of sites which carry identical material, consisting of the R distribution(s), the contributed extensions, documentation for R, and binaries. Type its full address, http://cran.r-project.org/ or just type CRAN into Google and you will be taken to the site. Select under CRAN CRAN - Mirrors -> South Africa (http://cran.za.r-project.org) Under Download and Install R click Windows and then base -> R-2.7.0-win32.exe (this will change depending on the latest version) If you select run, then wait for the program to be installed in your computer. If you have no access to internet, the best way is to ask someone who has access to internet to download it and save it onto a CD or a memory stick for you. You can then run the setup program in your PC/Laptop and the R software will be installed. May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 4
A more detailed set of instructions about installation of R for windows can be obtained at 2. Loading and installation of R packages Every function in R is in a package, and packages come with documentation. In addition to those packages included in the initial installation, R comes with several useful contributed packages. There are 1337 packages in the R repository as of 28 th March, 2008. You can view the packages available for installation by entering the command library(). Type >library() To check the functions available on a certain package, type for e.g. >library(help= stats ) will open a help window containing one-line descriptions of all functions in the stats package. When you want to do some specific tasks, for example Analysis of dose-response curves, you will need to install the package that does the intended tasks. In this case you install drc package. The easiest way to install an additional package (eg. drc) is through the Packages menu of the R graphical users interface (RGUI): From the menu select Packages ->Install package(s). A window opens asking you to pick a CRAN mirror site; choose the one nearest you (for our case, its South Africa I think); after this, a window will open with the various packages that are available on CRAN. Select the package that you want to install. After selection, click the OK button and the package will be installed (see Figure below). Figure: select Packages-> Install package(s) Figure: Selecting CRAN Mirror May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 5
An alternative way is to first download the binary version of the package(it will be something like *.zip), save it in some hard disk or a CD-ROM. Again through the Packages menu of the RGUI, Select Install package(s) from local zip files... Go to the location where you saved the package. Select it and click the Open button and the package will be installed (see Figures below). Figure: Select Install package(s) from local zip files Figure: Installing the drc package Sometimes might be difficult to access the CRAN mirror. In that case you could directly install from http://cran.r-project.org/ and then click on Packages. Select the package and download the Windows binary version (.zip) file. It s not enough to just install the package. You have to load it in memory to make it available for use. You accomplish this by entering the command library( nameofpackage ) at the R console, where nameofpackage is replace with drc for our case type in >library(drc) or from the menu bar packages->load package ->drc Below is the figure May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 6
Figure: Loading the drc package Notice that there is an error message (highlighted). The reason is that drc requires some other package to have been installed for it to run. Here, plotrix is one of them in addition to lattice, MASS and nlme. You will therefore need to install plotrix. You accomplish this using any one of the two procedures of installing packages covered above. In general a good number of packages will not work singly and will require other packages to have been installed. This will be clear from the messages whenever you attempt to load a package and therefore be on the watch. To see the packages currently loaded into memory, type in >search() 2.1 Updating R and the packages Since R and its packages are continuously being improved and new ones added, it is a good idea to upgrade to the new versions from time to time. When R and the packages are installed directly from CRAN, then they are of the latest version, because the CRAN is constantly being updated. Update all installed packages with the latest versions periodically through the Packages menu of the RGUI, by selecting Update packages from CRAN and following the appropriate steps as in the case of installation done above. It is only newer versions of packages than those currently installed on your computer that will be downloaded and installed. Since some packages are bundled with the R distribution (recommended packages), it is a good idea to do a new installation of R once in a while in order to update these. To check for new releases go to http://www.ats.ucla.edu/stat/r/icu/updating_win.htm http://cran.r-project.org/ - FAQs-> R windows FAQ->section 2.1 May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 7
3. Running the Software To run R click on Start -> All programs -> R-2.7.0 The R icon was created by the setup program during installation. In the event that there is no icon, using explorer you go to C:\Programs Files\R\R-xxx\bin and look for the R icon named Rgui and then double click it. The program will be started. You can also make a copy of the R icon and place it on the desktop What appears when R is started is called R Graphical Users Interface (RGUI). R issues a prompt where it expects the user to issue a command. The default prompt is >. (see Figure below). This is where you type commands that instruct R to do some tasks. Figure: The R Graphical Users Interface (RGUI) First thing that appears is the version number of R and the date of the version. It is recommended that you visit the CRAN site regularly to ensure that you have got the latest version of R. If you have an old version in your PC it is best to uninstall it before downloading the latest one. Also the header explains that R is free software and comes with ABSOLUTELY NO WARRANTY and allows you to see the list of current contributors Typing in > license() > contributors() > citation() > demo() You obtain the below figures May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 8
Figure: Typing license() at R prompt Figure: Typing contributors() at R prompt Figure: Citing the R software Figure: Typing demo() at R prompt Type in the following and see what happens > demo() > help() > help.start() The command; citation(). shows how to cite the R software in your written work. The R Development Core Team has done a huge amount of work and we, the R users, should pay them due credit whenever we publish work that has used R. If you want to exit before doing anything, you type >q() From the menu bar File->Exit 3.1 Changing your workspace The first thing you probably want to check is your working directory. You can find this from R by issuing the command getwd() at the R prompt May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 9
>getwd() To change the directory type in >setwd( c://rcourse ) drive c) or from the menu click on (this will change the directory to data on File -> Change dir Figure: Changing the working directory Figure: Choosing the desired diretory It is always a good idea to set the current or working directory before starting any serious work with R. Make a new directory for each project that you do with R and store your data and the related R programs or code in that directory. Use the microsoft windows file explorer in the normal way to create your directory. It is also possible to create an icon that starts R such that the working directory is some desired one. You can achieve this by doing the following; Start R, then change the current directory to the desired directory (eg. C:\Forests). Exit R (by issuing the command q() or use menu) and answer Yes when prompted Save workspace image? When this is done, you should have a file with the name.rdata in the desired directory. Double clicking the.rdata icon will start R and the current directory of RGUI will be set to the desire directory i.e. the directory in which.rdata resides. The best way is to instruct R to choose the required working directory at startup so that you don t have to change working directory every time. Look for the shortcut to R-2.6.2\bin\Rgui.exe on the desktop, Start menu file tree or somewhere in your computer. Right-click the shortcut, select Properties... and change the `Start in' field to your Desired directory. See figure below; May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 10
Figure. Selecting the Startup directory Start in field is changed to C:\Desired directory 4. Getting help in R and also online help Online documentation for most of the functions and variables in R exists and can be printed on screen. The easiest way for getting help in R is to click on the Help button on the toolbar of the R general user interface (RGUI). R comes with several official manuals. The following manuals for R are downloadable as PDF files or can be directly browsed as HTML; An Introduction to R, The R language definition, Writing R Extensions, R Data Import/Export, R Installation and Administration, R Internals and The R Reference Index and FAQs. These should be your primary source of information. Note also that the main location for general questions about R is the R-Help mailing list, which we be discussion later (but please make sure to read the posting guide before posting a question there!). To display and access the manuals and other information in html format, issue the command help.start() as follows; >help.start() and you obtain the screen below. Double click on any of the items to display further information. You will be surprised by the amount of help information available. If you need to search for something specific select Search Engine & Keywords and then follow the instructions. May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 11
Another way of accessing the same page is through the menu Help ->Html help If you are connected to internet, you can type CRAN in Google and search the help you want at CRAN. If you know the name of function or subject that you seek help for, then you just type a question mark (?) at the R prompt followed by that name. For example to get help on write.table type >?write.table and obtain the screen below; May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 12
Figure. Obtaining help for write.table other times, you only know the subject for which you want help and not the function (eg. anova). In this case use the help.search function and enclose your query with double quotes as follows; >help.search( anova ) or from the menu bar Help->Search help. (enter the anova in the Question dialogue box) and obtain the screen below May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 13
Figure. Help information on anova (help.search( anova ) Other useful functions for help is find and apropos. The find function tells us what package the object or item you are looking for is contained. For example if you want to find plot, you type; >find( plot ) (remember double quotes on plot) When you use find function, the item that you are looking for must be a function in one of the installed packages and also unique. On the other hand, the apropos returns names of all objects in the search list that match your (potentially partial) enquiry. For example to find all items with name table, you type; >apropos( table ) or using the menu Help->Apropos (enter table in the Question dialogue box) Results of find( plot ) and apropos( table ) commands are shown on the figure below. May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 14
Figure. Results of find( plot ) and apropos( table ) If you know the name of a package and you need further information concerning the package, you type help(nameofpackage). For example I know there is a package called stats and I want more information about it. At the R console type >help(stats) (and select index to get further details on the package) (see Figures below). Figure: Help on stats package Figure: Further help on stats package Summary on the help system >help.start() and R manuals; brings up a web browser pointing at all of the help files >help() or >?help searches through functions in the already loaded packages >apropos() looks through all accessible R objects, meaning it will match names of functions containing a given string May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 15
>help.search( surv ) finds all items that have the entry surv and so you end up with very many entries. To avoid this, you will need to be more specific with your search. >help(package=drc) gives information on the package named drc. >example(glm) >demo() demo(lm.glm) modeling examples. runs the glm example lists all available demos runs demonstration on some linear and generalized linear 5. How R works R works with objects and stores these in active memory. Objects have names. Names start with letters (A-Z or a-z) and are case sensitive, for example, Wgt and wgt are different objects and so are Y and y. Named objects are used to represent Variables, data, functions, results and so on. You do actions on objects via operators. Operators are arithmetic, logical, comparison and functions. You work with R by issuing commands. Commands can be typed directly to the R console, immediately after the prompt, which usually appears as >. However, you can also write the commands via a text editor, then send the commands to R for execution. From the menu click File ->New Script (or Ctrl+N). Type the command library() in the R editor window and click the right mouse button and select Runline or selection (Ctrl+R) and the command is executed. Figure: Opening a text editor Figure: Issuing a command via a text editor One advantage of working via a text editor is that you can save your commands in a file for future reference. May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 16
Another way of working with R but has a limited number of operations is via the R menus on the R Graphical Users Interface (RGUI). More on text editor will be covered when we look at writing R scripts later. The values of arithmetic expressions are displayed, but you see nothing if the command is an assignment statement. For example, observe what happens to the following set of commands; >2+4 >a<-3 Below is a representation of how R works. Objects (input=data) Operators (arithmetic, logical, comparison, functions) Objects (results=data) 5.1 Creating, listing and deleting objects We have already mentioned that R works on objects through operators. It is thus necessary that we know how to create, view (list) and delete them. To create an object (new variable), we use the symbol <- or =. It is called an assignment operator. At the R prompt, type >x<-2 or >2->x or x=2 Here, we have assigned 2 to x, so the x is our object and represents the number 2. Try the following also >12->z >y<-5+3 or y=5+3 >Total<-y+x (sum of y and x) To view or list x, z, y and Total we just type each of the objects at the R prompt. >x >z >y >Total We obtain the following results on the R console window (see Figure below) May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 17
Figure. How R works (results of calculations etc) Up to this point, we have created four objects namely x, z, y and Total To list the existing objects, you issue ls() command at the prompt >ls() Listing existing objects To remove the y object, issue the command rm(y) >rm(y) Remove the object named y To check that y has indeed been removed, repeat the ls() command >ls() Listing remaining objects To remove all objects, issue the command rm(list=ls()) >rm(list=ls()) Remove all objects and finally list to check that all have been removed, issue the command ls() >ls() List the remaining objects Results displayed on the R screen will look as below; Figure. How R works continued May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 18
Practice creating objects and carrying out operations on them. You will need to perform these operations a couple of times in order to be familiar with how R works. R can also be used as a calculator as well as a spreadsheet. The following commands give a glimpse of what R can do. You will find it most beneficial, actually, to type these commands into R to see the results yourself. Calculating with numbers. > 12 * 3-12/2 + sqrt(4) [1] 32 > 4^2 [1] 16 > 1:10 [1] 1 2 3 4 5 6 7 8 9 10 > sum(1:10) [1] 55 > mean(1:10) [1] 5.5 > sd(1:10) [1] 3.027650 You can use R like a calculator. The * symbol stands for multiplication and the ^ symbol stands for exponentiation. The colon operator : creates an array of numbers from the first to the second. R has a number of built-in functions such as mean and sum that have obvious meaning. The symbol [1] that precedes the output says that the first row begins with the first number of the output. Long answers are broken across rows, like in this example with similarly useful row labels. > 1:100 [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 [91] 91 92 93 94 95 96 97 98 99 100 Calculating with arrays R can do arithmetic operations on arrays. If you multiply an array of numbers by a single number, the multiplication happens separately for each number. You can also add or multiply equal-sized arrays of numbers. > 2 * (1:10) [1] 2 4 6 8 10 12 14 16 18 20 May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 19
> (1:10) + (10:1) [1] 11 11 11 11 11 11 11 11 11 11 > (1:4)^2 [1] 1 4 9 16 Assigning variables You can use the = or <-sign to create new variables. Older documentation may use <- instead of the = sign, but both are valid methods. Typing the name of a variable displays it. > a = 1:10 > a [1] 1 2 3 4 5 6 7 8 9 10 > mean(a) [1] 5.5 Combining arrays The c function in R concatenates things. (Because c is a reserved function name in R, it is preferable not to use c as the name of a variable else you or R gets confused. Use cc when you really want to use c.) > a = 1:5 > b = 10:15 > cc = c(b, a, b) > cc [1] 10 11 12 13 14 15 1 2 3 4 5 10 11 12 13 14 15 Using functions R has many built in functions and you can write your own if you want. To see a function, just type its name. If you want actually to use the function, you need to add parentheses at the end, possibly with arguments inbetween. > mean function (x,...) UseMethod("mean") <environment: namespace:base> > mean(cc) [1] 9.705882 Here is code to create a function that computes the area of a rectangle. > findarea = function(x, y) { return(x * y) } > findarea(20, 4) [1] 80 Sometimes you may see + prompt. That's how R lets you know that the previous command was incomplete. This can happen when you have a command that begins with a ( or a { and you type a return before the command ends. You can often rectify May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 20
this by typing the missing } or ). If you have mistyped and can't get out of the + prompt, pressing the Esc key usually works to get back to a regular prompt. Creating a Basic plot >x=(0:100)/10 # from 0 to 100, increment of 0.1 >plot(x,x^3-13*x^2+39*x) Plotting curves instead of points >plot(x,x^3-13*x^2+39*x,type='l') Labelling axes >plot(x,x^3-13*x^2+39*x,type='l', xlab="time",ylab="intensity") May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 21
5.2 Printing from R Results of commands issued at the R console window are displayed on the same window. If you need a hard copy of the same, you have to print them. Select File -> Print Alternatively click on the printer icon of the menu. Figure: Printing from R For graphs and plots, activate the window that has what you need printed, click the right mouse button and then select print. In all the above cases, you can print to a file by selecting the check box print to file and responding to questions. 6. Working with data in R Perhaps the most important item in any statistical software is to be able to handle. Entering raw data or reading datasets from other sources. Data entry in R is achieved by first creating data vectors and lists, then converting them into a data frame (dataset with rows and columns). A set of elements defines a data vector. When all the elements are real numbers then we have a numeric data vector and when they are text, we talk of a character data vector. Comments can be included in the statements by prefacing them with a hash (#) sign. To create a numeric vector use c() function as follows; >weight<-c(2.3, 1.9, 2.6, 3.1, 4.7, 10) and to create a character vector we type >month<-( January, February, March, April, May, June ) We usually measure or observe more than one attribute from a study unit also called a sampling unit. Therefore we organize data sets into collections of variables May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 22
(vectors). In R, such collections are called data frames in R. You generate a data frame by combining variables whereby each variable becomes a separate column. In order that a data frame to represents the data properly, the sequence in which observations appear in the vectors (variables) must be the same for each vector and each vector should have the same number of observations. For example, the first observations from each of the vectors to be included in the data frame must represent observations collected from the same sampling unit. To demonstrate the use of data frames in R, we have available data on height and diameter of 14 Prunus africana trees in Mabira forest Uganda. TreeNo 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Height 30.7 36.4 35.1 20.6 31.7 31.7 37.1 34.8 25.9 27.3 28 30.6 22.3 14.4 Diameter 66 147 126 56 93 99 104 103 32 44 67 56 35 26 First we create the variables (vectors) TreeNo, Height and Diameter respectively as follows; >TreeNo<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14) >Height<- c(30.7,36.4,35.1,20.6,31.7,31.7,37.1,34.8,25.9,27.3,28,30.6,22.3,14.4) >Diameter<-c(66,147,126,56,93,99,104,103,32,44,67,56,35,26) Next, we combine the three variables into single data set (data frame) called prunus >prunus<-data.frame(treeno,height,diameter) The creation of the prunus data set (data frame) can be done in single step by typing > prunus<-data.frame(treeno=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14), Height=c(30.7,36.4,35.1,20.6,31.7,31.7,37.1,34.8,25.9,27.3,28,30.6,22.3,14.4 ), Diameter=c(66,147,126,56,93,99,104,103,32,44,67,56,35,26)) To print to the screen the contents of this data set, type >prunus Sometimes it is necessary to edit the data including adding another column of data representing another variable. To do this you Use the fix()' function. This fix function only works when a data frame is made. For tree no 14, we shall change its diameter value from 26 to 29. Also tree numbers 1, 2, 6 and 11 are indigenous and the remaining exotic. We shall create a variable called variety to store this information. Type in >fix(prunus) May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 23
and obtain the R Data Editor window from which you do the required changes. Figure: The R Data editor Figure: Creating the variety variable Figure: The new data after editing Figure: Obtaining help for data editor You can use data.entry() function for entering and editing variables that already exist but not for creating new variables. Type in >data.entry(prunus) Once done, close the R data editor widow and the changes are automatically saved. To avoid confusion, we remove the single vectors TreeNo, Height and Diameter once a data frame has been created. Use rm(vectorname) command as follows >rm(treeno) >rm(height) >rm(diameter) To access a variable from within a data frame, we use the $ sign. For example to print the contents of the Diameter vector in the prunus data set, we type >prunus$diameter One can also recode data. Let us say those TreeNo less than 9 are Exotic and more than or equals to 9 are Indigenous. We use the ifelse function. Type in May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 24
>prunus$variety1<-ifelse(prunus$treeno<9, Exotic, Indigenous ) >prunus 6.1 R scripts Previously we looked at the R Editor. We use the R editor to create R scripts. R scripts are lists of instructions (commands) that R can perform. Advantages of scripts are that; 1. A sequence of tasks such as data entry, analysis and graphical preparation to be repeated quickly and precisely 2. A sequence of steps used to complete a task, say data analysis is documented permanently 3. Many similar analyses can be simplified 4. Sharing of data, analyses and techniques are simplified. 5. You can do a documentation of your programs. To create a script file from the menu select File->New script You obtain a window where you type your instructions (commands) and also you can submit these for execution. Once finished you save your script file in a place you can assess it for future editing/additions. The steps of creating the prunus data can be saved in a script file file (see Figure below) Figure: Opening a new script file Figure: Writing and documenting the script To save the script file, make sure the script window is active and then from the menu click File -> Save as (and provide the name. Will save it with extension.r) You can use any text editor to create your script files. Just remember to save it with extension.r May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 25
7. Importing data and data files From the previous section we have demonstrated that it is possible to generate a dataset from scratch. However, data sets are better managed with spreadsheets such as excel or databases such as access. Therefore it is necessary to both import and export data when working with R. 7.1 Importing a text data file First, recall that R uses the working directory (or the startup directory). Your text data file must be located in the working directory for it to be accessible. Find out your what is your working directory and change it if necessary. Use getwd() and setwd() commands. If the file you want to read is not in the working directory, you will need to specify it s complete path. Text data files are also said to be in ASCII format. Most software do accept and give data files in ASCII format and therefore it is the most universal format. You can view and edit text files with most (all?) text editors. To read a text file, use the command read.table. The read.table function creates a data frame, with cases (lines) and fields (variables). Suppose we have a text data file named timetodeath and located at c:\rcourse\ directory. Use notepad to view the data. Note that the file has variable headings and data points are separated by space. To read and view it in R, type >timetodeath<-read.table(file=" c://rcourse/timetodeath.txt", header = TRUE, sep="") >timetodeath # List the data for visualization The data will appear as spreadsheet with rows as cases and columns as variables. The columns will have names equal to those in the header of the text datafile. Cases will equal total number of lines excluding header. Next, we read a comma separated values file (csv). In excel, you can create csv file. Suppose we have a csv file named timetodeath in the same directory as before. Then to read it into R, issue the command; >mydata<-read.csv(file= c://rcourse/timetodeath.csv", header = TRUE) >mydata # List the data for visualization Next, we look at writing (saving) these data files back to text formats and with different names. So we can be sure that all went well. Remember both files are in memory. To achieve this, we shall use the write.table function as follows (copying file timetodeath to ttdeath and mydata to mdata ) >write.table(timetodeath, file= c://rcourse// ttdeath, quote=false) >write.table(mydata, file= c://rcourse//mdata, quote=false) May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 26
or for a csv file type >write.csv(mydata, file= c://rcourse//csvmdata ) We can now navigate to the directory c:\rcourse and open these two files. They will both be text. We can use Notepad to view them. 7.2 Accessing built in R datasets A large number of datasets are supplied with R (stored in R packages) and to access a dataset, the package that stores it must be loaded first. Issuing the command library() shows you which packages are installed at your site. There is a package called datasets. This consists of R datasets and comes with an R installation. To see list of datasets currently available (actually in datasets package), you type >data() and obtain a list of 101 datasets (as of 3 rd April, 2008). To load the file trees (Girth, Height and Volume for Black Cherry Trees), which we know exists in datasets, type >data(trees) >trees To check datasets in the packages type >data(package =.packages(all.available = TRUE)) sets in all *available* packages) #(to list the data To access a dataset in a particular package, install and load the package first (if it is not) and then type data(filename, package= packagename ). e.g > data(acacia, package= ade4 ) #(the dataset acacia from the package ade4) >acacia If you want to export your R dataset into excel for viewing/editing, you use the write.table or write.csv commands discussed above. Now examine the contents of your saved data using Notepad (or any other editor). It is a good idea to save your workspace so that when you resume next time, you shall start where you left and also be able to access datasets that you have been working on. When you exit R, you will be prompted to save workspace at exiting. Respond by clicking yes. 7.3 Accessing data using the RODBC package Open DataBase Connectivity is set of standardised functions for transferring data between SQL databases and applications. To use it one needs a driver manager, provided in the odbc library, and a driver provided for/by the specific database engine. May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 27
One of the R packages named RODBC implements R functions for moving data between dataframes and sql tables. First, you have to install and load the RODBC package by following the usual steps as was done earlier. Change your working directory to the one where your data is stored. Reading Access database files with RODBC Ensure that an access database file exist and note the name of the table that you want to read into R. For example, suppose we have an access database named Prunus africana in c:/rcourse and that the data table is named prunus. To read this data into R, type the set of commands shown below in succession; >prunus<- c://rcourse//prunus africana.mdb >new <- odbcconnectaccess(prunus) >prunus2 <- sqlfetch(channel = new, sqtable = "prunus" ) >prunus2 We may also want to write a data table into a database (for our case, into the Prunus africana database). First, read the table, edit/modify it and then put it into the database as new table altogether. This is how we do it. >prunus<-'c://rcourse//prunus africana.mdb' (reference the database) >new <- odbcconnectaccess(prunus) (establish a channel) >prunus2 <- sqlfetch(channel = new, sqtable = "prunus" ) >fix(prunus2) ( recall the command for modifying/editing data frames) >sqlsave(channel=new, dat=prunus2, tablename ='logdiameter') (save table as logdiameter). Excel sheets with RODBC Similarly, ensure that an excel data file exist. Then enter in succession the following set of commands at the R prompt. >prunus<- c://rcourse//prunus africana.xls >new2<- odbcconnectexcel(prunus) >outprunus <- sqlfetch(channel =new2, sqtable = "prunus" ) >outprunus Even in this case of excel, we can save the new worksheet using sqlsave command. To read other files, you will need to check their ODBC connectivity functions from the RODBC package and proceed on the same lines as done above. In conclusion, a wide range of database connectivity functions exist in RODBC package. Visit the package help manual to learn more. It is recommended that you consult the R Data Import/Export manual for detailed information about data import and export when using the R system. Finally there is another package called Foreign in the CRAN which has information on importing and exporting data from and to other software. May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 28
8. The R Graphical User Interface (RGUI) [The R menus] When you have started R and it is ready to receive commands, you can perform the various actions via the menus as opposed to issuing commands at the R prompt (the console). This is achieved through a set of menus and sub-menus on the R Graphical User Interface (RGUI). In the following pages, we review each of these menus. File menu source R code New script Open script Display file(s) Load Workspace Load History Save History Change dir Print Save to File Exit Edit Menu Copy Paste Paste commands only Copy and Paste Select all Clear console Data editor and GUI preferences View Menu Toolbar Statusbar May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 29
Misc Menu Stop current computation Stop all computations Buffered output Word completion Filename completion List objects Remove all objects List search path Packages Menu Load package.. Set CRAN mirror Select repositories Install package(s) Update pacakage(s) from local zip files Windows Menu Cascade Tile Arrange Icons R Console May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 30
Help Menu Console FAQ on R FAQ on R for Windows Manuals (in PDF) R functions (text) Search help. Search.r-project.org. Apropos R project home page About 9. The R Manuals and other documents Tremendous amount of information on R is freely available on the world wide web. The most authoritative site being the comprehensive R archive network (CRAN) at http://cran.r-project.org/. At this site, you will find a variety of R manuals. What follows is a list of them; An Introduction to R The R language definition Writing R Extensions R Data Import/Export R Installation and Administration R Internals The R Reference Index. In addition, there is available other contributed documentation in three categories as follows; a) Documents with more than 100 pages and include; Using R for Data Analysis and Graphics - Introduction, Examples and Commentary Simple R Practical Regression and Anova using R Web Appendix(An R and S-PLUS Companion to Applied Regression) An Introduction to S and the Hmisc and Design Libraries Statistical Computing and Graphics Course Notes An Introduction to R: Software for Statistical Modelling & Computing Introduction to the R Project for Statistical Computing for Use at the ITC,Analysis of Epidemiological Data Using R Epicalc and Statistics using R with Biological Examples May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 31
b) Documents with fewer than 100 pages; R for Beginners Kickstarting R (version 1.6) Notes on the use of R for psychology experiments and questionnaires R for Windows Users (version 2.0) Building Microsoft Windows Versions of R and R packages under Intel Linux A Guide for the Unwilling S User The R language - a short companion Fitting Distribution with R Econometrics in R The Friendly Beginners R Course An R companion to Experimental Design The R Guide Multilevel Modeling in R and Statistics with R and S-Plus. c) Short Documents and Reference Cards; d) Task Views R reference card R and Octave Time series reference card regression reference card and R reference card The Task Views from the CRAN website is also a useful tool to find out different areas covered by R. Click on Task Views and the below screen will appear. By clicking on any of the area of interest, there will be detailed description and packages used will appear. Currently, 16 views are available. May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 32
e) R News Finally, R News is a newsletter dedicated to R. It is a newsletter of the R project for statistical computing and has short to medium length articles covering topics of interest to users or developers of R. It includes; Changes in R: new features of the latest release Changes on CRAN: new add-on packages, manuals, binary distributions, mirrors,... Add-on packages: short introductions to or reviews of R extension packages Programmer's Niche: nifty hints for programming in R (or S) Hints for newcomers: Explaining sides of R that might not be so obvious from reading the manuals and FAQs. Applications: Examples of analyzing data with R The newsletter as a medium for communication intends to fill the gap between the R mailing lists and scientific journals: Compared with emails it is more persistent, one can cite articles in the newsletter and because the newsletter is edited it has better quality control. On the other hand, when compared to scientific journals, it is faster, less formal and focuses on R. 9.1 Mailing lists: A couple of mailing lists for R exist. Go to the site http://www.rproject.org/mail.html#instructions to view these lists. The R-help is the main one. It is for announcements about the development of R, the availability of new code, questions and answers about problems and solutions using R, documentation of R and for the posting of nice examples and benchmarks among others. You can subscribe (highly recommended) to this mailing list so that you are able to receive updates about the R software. All you need is to fill and send out the form below. Go to the site https://stat.ethz.ch/mailman/listinfo/r-help to access the form. Once done, you will be sent an email requesting confirmation, to prevent others from gratuitously subscribing you. It is a hidden list, meaning that the list of members is available only to the list administrator. To post a message to all the list members, send email to r-help@r-project.org. May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 33
10. Useful websites and references The following is a list of some useful websites from which you can obtain useful information on learning and using the R software. http://www.nceas.ucsb.edu/scicomp/rshortcourse.html http://tw.raunvis.hi.is/tutor-web/stats.dep/1500r/lecture20/ http://www.r-project.org/doc/bib/r-publications.html Book references Also there are excellent books that one could use to learn the R software. A few of these books are provided below; Peter Dalgaard. Introductory Statistics with R. Springer, 2002. ISBN 0-387-95475-9. (Available in ILRI-ICRAF RMG Library) John Maindonald and John Braun. Data Analysis and Graphics Using R. An Example-Based Approach, 2 nd Ed. Cambridge University Press, Cambridge, 2003. ISBN 0-521-81336-0. (Available in ILRI-Nairobi Library) John Verzani. Using R for Introductory Statistics. Chapman & Hall/CRC, Boca Raton, FL, 2005. ISBN 1-584-88450-9. http://wiener.math.csi.cuny.edu/usingr/ Julian J. Faraway. Linear Models with R. Chapman & Hall/CRC, Boca Raton, FL, 2004. ISBN 1-584-88425-8. http://www.maths.bath.ac.uk/~jjf23/lmr/ May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 34
Appendix 1: Example on use of R a) The following data set in Table 1, which is artificial, will be the data used throughout much of this course. It describes an experiment carried out to study the effect of supplementation of weaned lambs on their health and growth rate when exposed to helminthiasis. Sixteen Dorper (breed 1) and 16 Red Maasai (breed 2) lambs were treated with an anthelmintic at 3 months of age (following weaning) and assigned at random within blocks of 4 per breed ranked on the basis of 3-month body weight to supplemented and non-supplement groups. Thus, 2 lambs from each block were assigned at random to supplemented and non-supplemented groups. All lambs grazed on pasture for a further 3 months. At night they were housed and lambs in the supplemented group were fed cotton cake and bran meal. Table 1 Data set A (Dorper / Red Maasai supplementation trial) - used throughout these notes Record BreedB+ ID Breed Sex Supplement Block Weight at 3m (kg) Weight at 6m (kg) PCV (%) FEC (epg) Weight gain (kg) 1 349 1 2 1 1 8.0 8.9 10 6500 0.9 2 326 1 2 1 1 9.0 10.1 11 2650 1.1 3 393 1 1 1 2 12.0 12.6 22 750 0.6 4 71 1 1 1 2 12.3 14.6 15 5200 2.3 5 271 1 1 1 3 13.0 13.7 19 4800 0.7 6 382 1 2 1 3 15.5 16.8 24 2450 1.3 7 85 1 2 1 4 16.3 18.2 27 200 1.9 8 176 1 2 1 4 15.9 17.7 21 3000 1.8 9 286 1 2 2 1 11.0 13.6 21 1600 2.6 10 183 1 1 2 1 9.9 11.7 21 450 1.8 11 21 1 2 2 2 11.6 13.1 25 2900 1.5 12 122 1 1 2 2 12.5 14.8 25 300 2.3 13 374 1 1 2 3 14.6 17.9 19 2250 3.3 14 32 1 2 2 3 14.2 16.9 22 2800 2.7 15 282 1 2 2 4 16.3 20.2 20 750 3.9 16 94 1 1 2 4 16.7 17.7 13 5600 1.0 17 127 2 2 1 1 7.5 8.1 26 1350 0.6 18 216 2 2 1 1 8.2 9.3 19 1150 1.1 19 133 2 1 1 2 10.1 11.7 30 200 1.6 20 249 2 1 1 2 8.8 10.4 28 0 1.6 21 123 2 2 1 3 1.6 12.6 23 600 1.0 22 222 2 2 1 3 11.3 13.5 24 1500 2.2 23 290 2 2 1 4 12.3 14.3 22 1950 2.0 24 148 2 1 1 4 13.1 14.9 26 500 1.8 25 142 2 2 2 1 8.2 11.5 25 850 3.3 May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 35
26 154 2 2 2 1 9.5 12.2 35 700 3.7 27 166 2 1 2 2 9.7 12.8 29 400 3.1 28 322 2 1 2 2 8.6 12.0 26 800 3.4 29 156 2 1 2 3 10.2 13.0 28 1550 2.8 30 161 2 2 2 3 11.2 14.6 22 550 3.4 31 321 2 1 2 4 12.1 15.9 25 1250 3.8 32 324 2 1 2 4 13.8 18.1 24 1100 4.3 Data recorded included body weight at 3 months of age and body weight, packed red cell volume (PCV) and faecal egg count (FEC) at 6 months of age. Some of the questions we will attempt to answer will be: Did supplementation improve weight gain? Did supplementation affect PCV and FEC? Were there any differences in weight gain, PCV or FEC between breeds? Importing data into R Data can be stored in a variety of software programs (e.g. ACCESS, EXCEL, Genstat etc). The best way is to Export the data into an ASCII file which can be used in R. From Excel, a commonly used spreadsheet program, the data can be saved as.csv (comma separated values) format. The first row in Excel should be reading the variable names and then the data. Any extra rows before the row indicating variable names, should be deleted. To read an ASCII file the following commands can be used > mydata<-read.table("c://rcourse/example_a.txt", header=true) > mydata Or > mydata<-read.csv("c://rcourse/example_a.csv", header=true, sep=",") > mydata Another way of getting the data into R is copy the table in Excel and then use > mydata<-read.table("clipboard", header=true) > mydata The disadvantage in using the above command is that you can t save the script to run again, but the advantage of having to type correctly a very long path and file name. To read an EXCEL or ACCESS file directly, RODBC package needs to be installed. May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 36
Following commands used to open the worksheet Sheet1 from Example A excel file. > data1<-"c://rcourse/example A.xls" > connect1<-odbcconnectexcel(data1) > mydata<-sqlfetch(channel=connect1, sqtable="sheet1") To read an ACCESS table Alldata from Example_A Access database > data2<-"c://rcourse/example_a.mdb" > connect2<-odbcconnectaccess(data2) > mydata<-sqlfetch(channel=connect2,sqtable = "Alldata") To display the names of variables in column order of the data frame, type in names(mydata) > names(mydata) [1] "ID1" "Record" "ID" "breed" "sex" "supp" "block" [8] "wt_3mo" "wt_6mo" "pcv" "fec" "wt_gain" To display the characteristics of the variables in mydata, type in str(mydata) > str(mydata) 'data.frame': 32 obs. of 12 variables: $ ID1 : int 1 2 3 4 5 6 7 8 9 10... $ Record : num 1 2 3 4 5 6 7 8 9 10... $ ID : num 349 326 393 71 271 382 85 176 286 183... $ breed : num 1 1 1 1 1 1 1 1 1 1... $ sex : num 2 2 1 1 1 2 2 2 2 1... $ supp : num 1 1 1 1 1 1 1 1 2 2... $ block : num 1 1 2 2 3 3 4 4 1 1... $ wt_3mo : num 8 9 12 12.3 13 15.5 16.3 15.9 11 9.9...z ` ` $ wt_6mo : num 8.9 10.1 12.6 14.6 13.7... $ pcv : num 10 11 22 15 19 24 27 21 21 21... $ fec : num 6500 2650 750 5200 4800 2450 200 3000 1600 450... $ wt_gain : num 0.9 1.1 0.6 2.3 0.7... Usually if a variable in not numeric, then R considers as a factor. To transform numerical variable breed into factor type: > mydata$breed<-as.factor(mydata$breed) Check again with str(mydata) if it has converted to a factor. Now change sex, supp, block as factors. May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 37
Exporting Data from R One can also save the data one has been working on onto a file. Saving mydata as mydata1.txt in the folder d:\rcourse. > db1<-"c://rcourse/mydata1.txt" > write.table(x = mydata, file=db1, quote=false, sep=";", row.names=false,col.names=true) Saving mydata as a T_data table in the Access database Example_A. > db1<-"c://rcourse/example_a.mdb" > connect<-odbcconnectaccess(db1) > sqlsave(channel=connect, dat=mydata, tablename="t_data", rownames=false) Data Exploration The first step before undertaking any statistical analysis, it is useful to explore the data. To summarize the variables in mydata type in summary(mydata) > summary(mydata) Record ID breed sex Min. : 1.00 Min. : 21.0 Min. :1.0 Min. :1.000 1st Qu.: 8.75 1st Qu.:131.5 1st Qu.:1.0 1st Qu.:1.000 Median :16.50 Median :179.5 Median :1.5 Median :2.000 Mean :16.50 Mean :209.4 Mean :1.5 Mean :1.531 3rd Qu.:24.25 3rd Qu.:297.8 3rd Qu.:2.0 3rd Qu.:2.000 Max. :32.00 Max. :393.0 Max. :2.0 Max. :2.000 supp block wt_3mo wt_6mo pcv Min. :1.0 Min. :1.00 Min. : 7.500 Min. : 8.10 Min. :10.00 1st Qu.:1.0 1st Qu.:1.75 1st Qu.: 9.525 1st Qu.:11.93 1st Qu.:20.75 Mean :1.5 Mean :2.50 Mean :11.688 Mean :13.86 Mean :22.72 3rd Qu.:2.0 3rd Qu.:3.25 3rd Qu.:13.275 3rd Qu.:16.12 3rd Qu.:26.00 Max. :2.0 Max. :4.00 Max. :16.700 Max. :20.20 Max. :35.00 fec wt_gain Min. : 0.0 Min. :0.600 1st Qu.: 587.5 1st Qu.:1.250 Median :1200.0 Median :1.950 Mean :1770.3 Mean :2.169 3rd Qu.:2500.0 3rd Qu.:3.150 Max. :6500.0 Max. :4.300 For variables that are continuous, the summary statistics are shown else for factors a frequency tabulation will be displayed. One could also get summaries on individual variables by typing summary(mydata$wt_3mo) > summary(mydata$wt_3mo) Min. 1st Qu. Median Mean 3rd Qu. Max. 7.500 9.525 11.600 11.690 13.270 16.700 May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 38
Each time you call a variable, you need to attach it to mydata i.e. type in summary(mydata$wt_3mo). If one runs the command attach(mydata), then at any one time one is specifying the variable, you do not need to type in mydata$ i.e. can type in wt_3mo instead of mydata$wt_3mo > attach(mydata) One can also check the means by supp and breed by the aggregate function > aggregate(data.frame(wt_gain=wt_gain), by=list(supplement=supp, breed=breed), mean) supplement breed wt_gain 1 1 1.3250 2 1 2.3875 1 2 1.4875 2 2 3.4750 For aggstat function to work, need to install ttool package. This package is not available on the CRAN but is available from the http://forums.cirad.fr/logiciel-r/viewforum.php?f=5 > aggstat(formula = wt_gain ~ supp + breed, data = mydata, FUN = sd) supp breed wt_gain 1 1 0.6158618 1 2 0.5436320 2 1 0.9508455 2 2 0.4590363 For histogram, qqmath, qqplot, xyplot etc one needs to load the package lattice To check the distribution of the variables whether normally distributed. > histogram(~ fec, n=15, xlab="facel egg count") May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 39
> histogram(~ pcv, n=15, xlab="packed Cell volume") 20 15 Percent of Total 10 5 0 10 15 20 25 30 35 Packed Cell volume > boxplot(fec, ylab = "Faecal egg count") Faecal egg count 0 1000 2000 3000 4000 5000 6000 > boxplot(pcv, ylab = "Packed cell volume") May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 40
Also try the following > densityplot(~fec, data=mydata) > densityplot(~pcv, data=mydata) > stripplot(~pcv, data=mydata) > stripplot(~fec, data=mydata) > qqmath(~pcv, distribution =qnorm, data=mydata) > > qqnorm(wt_6mo, main="normal Q-Q Plot for Wt at 6 months") > qqline(wt_6mo) > Normal Q-Q Plot for Wt at 6 months Sample Quantiles 8 10 12 14 16 18 20-2 -1 0 1 2 Theoretical Quantiles Histogram, boxplot, densityplot, stripplot, qqmath, qqnorm can be used to check the normality of the variable. In our example PCV and wt_6mo are normally distributed but FEC is skewed. To summarize the data by supplement and breed. > boxplot(fec~supp*breed, col="orange", xlab="supp/breed") > May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 41
0 1000 2000 3000 4000 5000 6000 1.1 2.1 1.2 2.2 Supp/Breed > boxplot(wt_gain~supp*breed, col="orange", xlab="supp/breed", ylab="weight gain") > Weight gain 1 2 3 4 1.1 2.1 1.2 2.2 Supp/Breed To check the relationship between pcv & fec > xyplot(pcv~ fec, xlab="faceal egg count") > May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 42
35 30 25 pcv 20 15 10 0 2000 4000 6000 Faceal egg count Data Analysis T-test This is used to determine the statistical differences between two mean. To select only records with supp=2 > mydata1<-mydata[mydata$supp==2,] > mydata1 To examine whether supplementation increased the weight gain in Red Masai significantly. > t.test(mydata1$wt_gain ~ mydata1$breed) Welch Two Sample t-test data: wt_gain by breed t = -2.9132, df = 10.095, p-value = 0.01534 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.9182062-0.2567938 sample estimates: mean in group 1 mean in group 2 2.3875 3.4750 Simple Linear Regression Let us consider the association between PCV and FEC in our example and ignore for the time being that lambs were of different breed and fed on different diets. From the scatterplot, one could see that there was a strong association between PCV and FEC. May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 43
To do a linear regression type lm(pcv ~ fec) > fm1<-lm(pcv ~ fec) > fm1 Call: lm(formula = pcv ~ fec) Coefficients: (Intercept) fec 26.834685-0.002325 Now to get the summary of the test type summary(fm1) > summary(fm1) Call: lm(formula = pcv ~ fec) Residuals: Min 1Q Median 3Q Max -9.6735-2.1960 0.2915 1.8324 9.7928 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 26.834685 0.958305 28.002 < 2e-16 *** fec -0.002325 0.000395-5.887 1.92e-06 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual standard error: 3.707 on 30 degrees of freedom Multiple R-squared: 0.536, Adjusted R-squared: 0.5205 F-statistic: 34.65 on 1 and 30 DF, p-value: 1.915e-06 To get the Analysis of variance table type in anova(fm1) > anova(fm1) Analysis of Variance Table Response: pcv Df Sum Sq Mean Sq F value Pr(>F) fec 1 476.20 476.20 34.651 1.915e-06 *** Residuals 30 412.27 13.74 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 To form a new variable group (breed & supp) with 4 levels type in > mydata$group=as.numeric(mydata$breed)*10 + as.numeric(mydata$supp) > mydata May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 44
Change the group variable to a factor > mydata$group<-as.factor(mydata$group) > str(mydata) Once a new variable is created, one needs to attach it again to the data frame. Type in > attach(mydata) To do One-way Analysis of Variance > fm1a<- aov(wt_gain~group) > anova(fm1a) Analysis of Variance Table Response: wt_gain Df Sum Sq Mean Sq F value Pr(>F) group 3 23.4413 7.8138 17.464 1.376e-06 *** Residuals 28 12.5275 0.4474 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 To do pairwise comparison type TurkeyHSD(fm1) > TukeyHSD(fm1a) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = wt_gain ~ group) $group diff lwr upr p adj 12-11 1.0625 0.1493641 1.97563587 0.0178725 21-11 0.1625-0.7506359 1.07563587 0.9616031 22-11 2.1500 1.2368641 3.06313587 0.0000034 21-12 -0.9000-1.8131359 0.01313587 0.0544896 22-12 1.0875 0.1743641 2.00063587 0.0149255 22-21 1.9875 1.0743641 2.90063587 0.0000122 > May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 45
To do a 2-way ANOVA with interaction > fm2<-aov(wt_gain~breed + supp + breed:supp) > anova(fm2) Analysis of Variance Table Response: wt_gain Df Sum Sq Mean Sq F value Pr(>F) breed 1 3.1250 3.1250 6.9846 0.01331 * supp 1 18.6050 18.6050 41.5837 5.522e-07 *** breed:supp 1 1.7112 1.7112 3.8248 0.06055. Residuals 28 12.5275 0.4474 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 To do a simple generalized linear model > fm3<-glm(wt_gain~breed + supp +breed:supp) > summary(fm3) Call: glm(formula = wt_gain ~ breed + supp + breed:supp) Deviance Residuals: Min 1Q Median 3Q Max -1.3875-0.4406-0.0500 0.3625 1.5125 Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) 1.0250 1.1824 0.867 0.3934 breed -0.7625 0.7478-1.020 0.3166 supp 0.1375 0.7478 0.184 0.8554 breed:supp 0.9250 0.4730 1.956 0.0605. --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for gaussian family taken to be 0.4474107) Null deviance: 35.969 on 31 degrees of freedom Residual deviance: 12.527 on 28 degrees of freedom AIC: 70.802 Number of Fisher Scoring iterations: 2 May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 46
Q. Repeat the above Analysis of variance for PCV and fill in the below table. Analysis of Variance Table Response: pcv Df Sum Sq Mean Sq F value Pr(>F) breed 1 supp 1 breed:supp 1 Residuals 28 560.38 Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Chi-square test The example is drawn from a group of 55 Dorper and 42 Red Masai lambs and 14 and 4, respectively, died before weaning. Thus to check if there is any association between breed and mortality a Chi-square test can be done. Chi-square.xls file is provided. > data1<-"c://rcourse/chi-square.xls" > connect1<-odbcconnectexcel(data1) > mydata <-sqlfetch(channel=connect1, sqtable="mortality") > mydata breed No Dead 1 Dorper 55 14 2 Red Masai 42 4 > prop.test(x=mydata$dead, n=mydata$no) 2-sample test for equality of proportions with continuity correction data: mydata$dead out of mydata$no X-squared = 3.0144, df = 1, p-value = 0.08253 alternative hypothesis: two.sided 95 percent confidence interval: -0.007064957 0.325679676 sample estimates: prop 1 prop 2 0.2545455 0.0952381 To use chisq function, first need to bind it to get a table > x<-cbind(mydata$dead, mydata$no - mydata$dead) > x [,1] [,2] [1,] 14 41 [2,] 4 38 May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 47
> res<- chisq.test(x) Pearson's Chi-squared test with Yates' continuity correction data: x X-squared = 3.0144, df = 1, p-value = 0.08253 To get expected values > res$expected [,1] [,2] [1,] 10.206186 44.79381 [2,] 7.793814 34.20619 May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 48
Appendix 2: Introduction to mapping with R By M. Lesnoff CIRAD, France (matthieu.lesnoff@cirad.fr) This note presents few examples of mapping with R from shape files. For more information, see http://r-spatial.sourceforge.net. We use the package maptools for importing shape files and the package sp for creating and plotting the maps (available on CRAN http://cran.at.r-project.org). These packages must be installed before implementing the examples. The shape files used for the examples are: African countries.dat African countries.dbf African countries.id African countries.ind African countries.map African countries.shp African countries.shx African countries.tab We load maptools (this loads automatically foreign and sp which are required by maptools): library(maptools) Loading required package: foreign Loading required package: sp We import the shape files with the function readshapepoly of maptools and we store them in the object mydata: mydata <- readshapepoly( C:/Rcourse/African countries.shp ) Function readshapepoly creates object of class S4. S4 sub-objects are referred as slots and can be listed with the function getslots of R: getslots("spatialpolygonsdataframe") data polygons plotorder bbox proj4string "data.frame" "list" "integer" "matrix" "CRS" The slots can be extracted with the code @ (instead of $ in S3). Auxiliary information is in the slot data: str(mydata@data) 'data.frame': 55 obs. of 27 variables: $ PAYS : Factor w/ 55 levels "Afrique du sud",..: 2 3 4 5 6 7 8 9... $ CAPITALE : Factor w/ 54 levels "Abidjan","Accra",..: 4 29 44 18 42... $ CONTINENT : Factor w/ 1 level "Afrique": 1 1 1 1 1 1 1 1 1 1... $ POP_94 : num 22600957 4830449 4304000 1326796 9190791... $ POP_EVOL : num 2.5 2.7 3.3 2.7 3.1 3.2 2.7 3 2.6 2.1... $ POP_HOM : num 11425492 2459015 2086000 634400 4492153... $ POP_FEM : num 11175465 2371434 2218000 692396 4698638... $ POP_0_14 : num 9946100 2011378 2005000 567470 4501196... $ POP_15_64 : num 11758841 2689498 2193000 670769 4566206... $ POP_65PLUS: num 893159 124757 106000 62561 0... May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 49
$ HOM_0_14 : num 5087856 1036710 992000 280437 2276272... $ HOM_15_64 : num 5894449 1351623 1045000 312098 2174784... $ HOM_65PLUS: num 441519 68244 49000 27904 0... $ FEM_0_14 : num 4858244 974668 1013000 287033 2224924... $ FEM_15_64 : num 5864392 1337875 1148000 358671 2391422... $ FEM_65PLUS: num 451640 56513 57000 34657 0... $ POP_URBAN : num 0 0 1386000 0 1287285... $ POP_RURAL : num 0 0 2918000 0 7903506... $ URB_HOM : num 0 0 701000 0 642285... $ URB_FEM : num 0 0 685000 0 645000... $ RUR_HOM : num 0 0 1385000 0 3849868... $ RUR_FEM : num 0 0 1533000 0 4053638... $ TERRE_CULT: num 3 2 12 2 10 43 13 9 3 2... $ TX_ALPHA : num 50 42 23 23 18 30 54 66 27 30... $ TX_INFLAT : num 20 23.2 4.3 12-0.5 11.7 8.6 8.2-4.2-4.9... $ TX_CHOMAGE: num 26 0 0 25 0 0 18 25 30 0... $ CR_INDUST : num 0.9 0-0.7 16.8 0 5.1-6.4 18 0.8 12.9... - attr(*, "data_types")= chr "C" "C" "C" "N"... We map the population growth rate (variable POP_EVOL): # define the colors of the palette that will be used mypal <- colorramppalette(c("yellow", "red")) # define the arrow to print in the map arrow <- list("spatialpolygonsrescale", layout.north.arrow(), offset = c(-1.8e+06, -3e+06), scale = 0.7e+06) # build and plot the map spplot( obj = mydata, zcol = c("pop_evol"), regions = TRUE, col.regions = mypal(20), #col.regions = gray(seq(1,0, by = -0.05)), colorkey = TRUE, scales = list(draw = TRUE), sp.layout = list(arrow), xlab = "Longitude", ylab = "Latitude" ) 4.5 2e+06 4.0 3.5 Latitude 0e+00 3.0 2.5 2.0-2e+06 1.5 1.0-2e+06 0e+00 2e+06 4e+06 Longitude Many other colors and palettes can be used (see?colors). For instance: May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 50
spplot( obj = mydata, zcol = c("pop_evol"), regions = TRUE, col.regions = gray(seq(1,0, by = -0.05)), colorkey = TRUE, scales = list(draw = TRUE), sp.layout = list(arrow), xlab = "Longitude", ylab = "Latitude" ) 4.5 2e+06 4.0 3.5 Latitude 0e+00 3.0 2.5 2.0-2e+06 1.5 1.0-2e+06 0e+00 2e+06 4e+06 Longitude We create a map that highlights Ethiopia: mydata@data$selec <- ifelse(mydata@data$pays == "Ethiopie", 1, 2) spplot( obj = mydata, zcol = c("selec"), regions = TRUE, col.regions = c("red", "lightgrey"), colorkey = FALSE, scales = list(draw = TRUE), sp.layout = list(arrow), xlab = "Longitude", ylab = "Latitude" ) 2e+06 Latitude 0e+00-2e+06-2e+06 0e+00 2e+06 4e+06 Longitude May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 51
Same map but highlighting also Somalia: mydata@data$selec <- ifelse(mydata@data$pays == "Ethiopie", 1, ifelse(mydata@data$pays == "Somalie", 2, 3)) spplot( obj = mydata, zcol = c("selec"), regions = TRUE, col.regions = c("red", "lightyellow4", "lightgrey"), at = c(0.5, 1.5, 2.5, 3.5), colorkey = FALSE, scales = list(draw = TRUE), sp.layout = list(arrow), xlab = "Longitude", ylab = "Latitude" ) 2e+06 Latitude 0e+00-2e+06-2e+06 0e+00 2e+06 4e+06 Longitude May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 52
Function spplot provides trellis maps. For example, we map jointly the rural and urban population sizes (variables POP_RURAL and POP_URBAN) (note that the same scale is used for both variables): spplot( obj = mydata, zcol = c("pop_rural", "POP_URBAN"), names.attr = c("rural", "Urban"), colorkey = TRUE, col.regions = mypal(20), scales = list(draw = TRUE), sp.layout = list(arrow), xlab = "Longitude", ylab = "Latitude" ) -2e+06 0e+00 2e+06 4e+06 Rural Urban 5e+07 4e+07 Latitude 2e+06 0e+00 3e+07 2e+07-2e+06 1e+07 0e+00-2e+06 0e+00 2e+06 4e+06 Longitude May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 53
Appendix 3: Few tips on R configuration By M. Lesnoff CIRAD, France (matthieu.lesnoff@cirad.fr) The R environment can be configured in many ways. We only present few of them, related to the text files Rprofile.site and Rconsole located in the folder etc under the R root: The user can open and modify these files to specify the needed configuration. R must be restarted after modification. a) File Rprofile.site Example of contents: # Things you might want to change # options(papersize="a4") # options(editor="notepad") # options(pager="internal") # to prefer Compiled HTML help options(chmhelp=true) # to prefer HTML help # options(htmlhelp=true).first <- function() { cat("welcome to R! \n") library(utils) # required for loading ttool library(lattice) library(ttool) trellis.device(color = 0) } The command options allows the user to set and examine a variety of global options which affect the way in which R computes and displays its results. For instance, the command options(chmhelp=true)implies that the R-help will use the compiledhtml format files located in the folder chtml located in each library folder (e.g., C:\R- 2.6.2\library\base\chtml). May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 54
Try: # options(chmhelp=true) options(htmlhelp=true) and then: options(chmhelp=true) # options(htmlhelp=true) The function.first allows the user to specify the commands that R will automatically run each time R will be started. In the previous example, R will display a welcome message, will load the packages utils, lattice and ttool, and will open a black-andwhite trellis graphics window. b) File Rconsole Example of contents: # Optional parameters for the console and the pager # The system-wide copy is in R_HOME/etc. # A user copy can be installed in `R_USER'. ## Style # This can be `yes' (for MDI) or `no' (for SDI). MDI = yes The command MDI = yes (= R environment in MDI mode) means that R will display the windows R Console, R Graphics and R Editor in a general and unique window (this is the default when R is installed). If the user specifies the command MDI = no (= R environment in SDI mode), R will display each of the windows R Console, R Graphics and R Editor in single and independent windows. This configuration can be done either by modifying the text file Rconsole or by using the menu Edit/GUI preferences: May 2008, Research Methods Group, ILRI/ICRAF An Introduction to R 55