Introduction Basics Simple Statistics More on S. Using R for Data Analysis and Graphics. 1. Introduction

Similar documents
OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE

Getting Started with R and RStudio 1

R: A self-learn tutorial

Basics of using the R software

A Short Guide to R with RStudio

5 Correlation and Data Exploration

Using R for Windows and Macintosh

Scatter Plots with Error Bars

Introduction to R June 2006

Package dsstatsclient

Psychology 205: Research Methods in Psychology

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Using SPSS, Chapter 2: Descriptive Statistics

Getting started manual

Introduction to Matlab

BIO503 - Lecture 1 Introduction to the R language

IBM SPSS Direct Marketing 23

Quickstart for Desktop Version

An introduction to using Microsoft Excel for quantitative data analysis

Prof. Nicolai Meinshausen Regression FS R Exercises

SAS R IML (Introduction at the Master s Level)

IBM SPSS Direct Marketing 22

Data analysis and regression in Stata

Exercise 1.12 (Pg )

Introduction to R and UNIX Working with microarray data in a multi-user environment

Minitab Session Commands

Installing R and the psych package

Below is a very brief tutorial on the basic capabilities of Excel. Refer to the Excel help files for more information.

PTC Mathcad Prime 3.0 Keyboard Shortcuts

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

Step 2: Save the file as an Excel file for future editing, adding more data, changing data, to preserve any formulas you were using, etc.

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Introduction Course in SPSS - Evening 1

Chapter 7: Simple linear regression Learning Objectives

Module 2 Basic Data Management, Graphs, and Log-Files

Beginner s Matlab Tutorial

Microsoft Excel. Qi Wei

SPSS 12 Data Analysis Basics Linda E. Lucek, Ed.D

Simple Linear Regression Inference

2+2 Just type and press enter and the answer comes up ans = 4

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

SPSS Tests for Versions 9 to 13

An R Tutorial. 1. Starting Out

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

GeoGebra Statistics and Probability

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

AMATH 352 Lecture 3 MATLAB Tutorial Starting MATLAB Entering Variables

Baseline Question Types and Report Outcomes November 7, 2014

Data exploration with Microsoft Excel: analysing more than one variable

SPSS Explore procedure

MATLAB Basics MATLAB numbers and numeric formats

R Language Fundamentals

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

How Does My TI-84 Do That

Getting Started With R

Gamma Distribution Fitting

Introduction to the TI-Nspire CX

R with Rcmdr: BASIC INSTRUCTIONS

Education & Training Plan. Accounting Math Professional Certificate Program with Externship

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

Dataframes. Lecture 8. Nicholas Christian BIOST 2094 Spring 2011

Using R for Linear Regression

Figure 1. An embedded chart on a worksheet.

January 26, 2009 The Faculty Center for Teaching and Learning

An introduction to IBM SPSS Statistics

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

Financial Econometrics MFE MATLAB Introduction. Kevin Sheppard University of Oxford

Systat: Statistical Visualization Software

Introduction. Chapter 1

CD-ROM Appendix E: Matlab

Simple Predictive Analytics Curtis Seare

Microsoft Excel 2010 Part 3: Advanced Excel

Gerrit Stols

Multiple Linear Regression

PCHS ALGEBRA PLACEMENT TEST

Regression and Programming in R. Anja Bråthen Kristoffersen Biomedical Research Group

IBM SPSS Direct Marketing 19

Directions for using SPSS

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

PASW Direct Marketing 18

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Package neuralnet. February 20, 2015

Precalculus REVERSE CORRELATION. Content Expectations for. Precalculus. Michigan CONTENT EXPECTATIONS FOR PRECALCULUS CHAPTER/LESSON TITLES

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

Big Data User s Guide for TIBCO Spotfire S+ 8.2

Data Analysis in SPSS. February 21, If you wish to cite the contents of this document, the APA reference for them would be

JavaScript: Introduction to Scripting Pearson Education, Inc. All rights reserved.

APPLICATION FOR PART-TIME EMPLOYMENT AS A TUTOR TUTOR IN THE DOLCIANI MATHEMATICS LEARNING CENTER

Time Series Analysis AMS 316

Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures

LAYOUT OF THE KEYBOARD

IBM SPSS Statistics for Beginners for Windows

Analysis of System Performance IN2072 Chapter M Matlab Tutorial

Working with Excel in Origin

Transcription:

Using R for Data Analysis and Graphics 1. Introduction

What is R? 1.1 What is R? R is a software environment for statistical computing. R is based on commands. Implements the S language. There is an inofficial menu based interface called R-Commander. Drawbacks of menus: difficult to store what you do. A script of commands documents the analysis and allows for easy repetition with changed data, options,... R is free software. http://www.r-project.org Supported operating systems: Linux, Mac OS X, Windows Language for exchanging statistical methods among researchers

Other Statistical Software 1.2 Other Statistical Software S-Plus: same programming language, commercial. Features a GUI. SPSS: good for standard procedures. SAS: all-rounder, good for large data sets, complicated analyses. Systat: Analysis of Variance, easy-to-use graphics system. Excel: Very limited collection of statistical methods. Good for getting the dataset ready. Matlab: Mathematical methods. Statistical methods limited. Similar paradigm, less flexible structure.

Introductory examples 1.3 Introductory examples A dataset that we have stored before in the system is called d.sport weit kugel hoch disc stab speer punkte OBRIEN 7.57 15.66 207 48.78 500 66.90 8824 BUSEMANN 8.07 13.60 204 45.04 480 66.86 8706 DVORAK 7.60 15.82 198 46.28 470 70.16 8664 : : : : : : : : : : : : : : : : : : : : : : : : CHMARA 7.75 14.51 210 42.60 490 54.84 8249 Draw a histogram of the results of variable kugel! We type hist(d.sport[,"kugel"]) The graphics window is opened automatically. We have called the S-function hist with argument d.sport[,"kugel"]. [,] is used to select the column.

Introductory examples 1.3 Introductory examples Scatter plot: type plot(d.sport[,"kugel"], d.sport[,"speer"]) First argument: x coordinates; second: y coordinates Many optional arguments! plot(d.sport[,"kugel"], d.sport[,"speer"], xlab="ball push", ylab="javelin", pch=7) Scatter plot matrix pairs(d.sport) Every column of d.sport is plotted against all other columns.

Introductory examples 1.3 Introductory examples Get a dataset from a text file and assign it to a name: d.sport <- read.table(...) "http://stat.ethz.ch/teaching/datasets /WBL/sport.dat", header=true) Start browser of operating system to get a file: d.sport <- read.table(file...())

Using R 1.4 Using R Within a window running R, you will see the prompt >. You type a command and get a result and a new prompt. > hist(d.sport[,"kugel"]) > An incomplete statement can be continued on the next line > plot(d.sport[,"kugel"], + d.sport[,"speer"]) R stores objects in your workspace > d.sport <- read.table(...) Objects have names like a, fun, d.sport R provides a huge number of functions and other objects

Using R 1.4 Using R An R statement consists of a name of an object object is displayed > d.sport a call to a function graphical or numerical result > hist(d.sport[,"kugel"]) an assignment > a <- 2*pi/360 > mn <- mean(d.sport[,"kugel"]) stores the mean of d.sport[,"kugel"] under the name mn

Using R 1.4 Using R Some special and useful functions (more details later): documentation on the arguments etc. of a function (or dataset provided by the system): > help(hist) or?hist list all objects (names) in the workspace: > objects() leave the R session: > q() You get the question: Save workspace image? [y/n/c]: If you answer y, your objects will be available for your next session.

Scripts and Editors 1.5 Scripts and Editors Instead of typing commands into the R window, you can generate commands by an editor and then send them to the R window.... and later modify (correct) them and send again. Text Editors supporting R WinEdt: Emacs: ESS: Tinn-R: http://www.winedt.com/ http://www.gnu.org/software/emacs/ http://stat.ethz.ch/ess/ http://www.sciviews.org/tinn-r/

Scripts and Editors 1.5 Scripts and Editors The Tinn-R Window

Scripts and Editors 1.5 Scripts and Editors Define Tinn-R Keyboard Shortcuts: Use dialog R / Hotkeys of R

Using R for Data Analysis and Graphics 2. Basics

Vectors 2.1 Vectors Functions and operations are usually applied to whole collections instead of single numbers, including vectors, matrices, data.frames ( d.sport ) Numbers can be combined into vectors using the function c() ( combine ) > t.v <- c(4,2,7,8,2) > t.a <- c(3.1, 5, -0.7, 0.9, 1.7) > t.u <- c(t.v,t.a) > t.u

Vectors 2.1 Vectors Generate a sequence of consecutive integers: > seq(1, 9) [1] 1 2 3 4 5 6 7 8 9 Since sequences of integers are needed very often, this can be abbreviated to 1:9. Equally spaced numbers: Use argument by (default: 1) > seq(0, 3, by=0.5) [1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Repetition: > rep(0.7, 5) [1] 0.7 0.7 0.7 0.7 0.7 > rep(c(1, 3, 5), length=8) [1] 1 3 5 1 3 5 1 3

Vectors 2.1 Vectors Basic functions for vectors: Call, Example length(t.v) sum(t.v) mean(t.v) var(t.v) range(t.v) Description Length of a vector, number of elements Sum of all elements arithmetic mean empirical variance range

Arithmetic 2.2 Arithmetic Simple arithmetic is as expected: > 2+5 [1] 7 Operations: + - * / ˆ (Exponentiation) These operations are applied to vectors elementwise. > (2:5) ˆ c(2,3,1,0) [1] 4 27 4 1 Priorities as usual. Use parentheses! > (2:5) ˆ 2 [1] 4 9 16 25

Arithmetic 2.2 Arithmetic Elements are recycled: > (1:6)*(1:2) [1] 1 4 3 8 5 12 > (1:5)-(0:1) [1] 1 1 3 3 5 Warning message: longer object length is not a multiple of shorter object length in: (1:5) - (0:1) > (1:6)-(0:1) [1] 1 1 3 3 5 5 Be careful, there is no warning in this case!

Character Vectors 2.3 Character Vectors Character strings: "abc", nut 999 Combine strings into vector of mode character: > t.names <- c("urs", "Anna", "Max", "Pia") Length of strings: > nchar(t.names) [1] 3 4 3 5 String manipulations: > substring(t.names,3,4) [1] "s" "na" "x" "ud" > paste(t.names,"z.") [1] "Urs Z." "Anna Z." "Max Z." "Pia Z." > paste("x",1:3, sep="") [1] "X1" "X2" "X3"

Logical Vectors 2.4 Logical Vectors Logical vectors contain elements TRUE or FALSE > rep(c(true, FALSE), length=6) [1] TRUE FALSE TRUE FALSE TRUE FALSE often result from comparisons: < <= > >= ==!= > (1:5)>=3 [1] FALSE FALSE TRUE TRUE TRUE Logical operations: & (and), (or),! (not). > t.i <- (t.a>2)&(t.a<5) > t.i [1] TRUE FALSE FALSE FALSE FALSE

Selecting elements 2.5 Selecting elements Select elements from vectors or data.frames: [ ], [,] > t.v[c(1,3,5)] [1] 15.66 15.82 16.32 > d.sport[c(1,3,5),1:3] weit kugel hoch OBRIEN 7.57 15.66 207 DVORAK 7.60 15.82 198 HAMALAINEN 7.48 16.32 198 For data.frames, use names of columns or rows: > d.sport[c("obrien","dvorak"), c("kugel","speer","punkte")] kugel speer punkte OBRIEN 15.66 66.90 8824 DVORAK 15.82 70.16 8664

Selecting elements 2.5 Selecting elements Using logical vectors: > t.a[c(true,false,true,true,false,false)] [1] 3.1-0.7 0.9 > d.sport[d.sport[,"kugel"] > 16, c(2,7)] kugel punkte HAMALAINEN 16.32 8613 PENALVER 16.91 8307 SMITH 16.97 8271

Matrices 2.6 Matrices Matrices are data tables like data.frames, but they can only contain data of a single type (numeric or character) Generate a matrix: > t.m1 <- matrix(1:10, nrow=2, ncol=5) > t.m1 [,1] [,2] [,3] [,4] [,5] [1,] 1 3 5 7 9 [2,] 2 4 6 8 10 > t.m2 <- matrix(1:10, ncol=2, + byrow=true) Transpose: t(t.m1) equals t.m2.

Matrices 2.6 Matrices Selection of elements as with data.frames: > t.m1[2,1:3] [1] 2 4 6 Matrix multiplication: > t.m1 %*% t.m2 [,1] [,2] [1,] 95 220 [2,] 110 260 Vectors are treated as 1-row or 1-column matrices (mostly) Functions for linear algebra are available.

Using R for Data Analysis and Graphics 3. Simple Statistics

Simple Statistical Functions 3.1 Simple Statistical Functions Count number of cases with same value: > table(d.blast[,"loc"]) L1 L2 L3 L4 L5 L6 14 10 14 10 24 24 Cross-table > table(d.blast[,"loc"], + d.blast[,"loading"]) 2.08 2.18 2.5 2.6 3.12 3.33 3.64 L1 2 2 1 5 1 2 1 L2 2 0 0 4 3 1 0...

Simple Statistical Functions 3.1 Simple Statistical Functions Estimation of a location parameter : mean(x) median(x) Variance: var(x) ; correlation: > cor(d.sport[,"kugel"], d.sport[,"speer"]) Correlation matrix: > t.cor <- cor(d.sport[,1:3]) > round(100*t.cor) weit kugel hoch weit 100-63 34 kugel -63 100-9 hoch 34-9 100

Hypothesis Tests 3.2 Hypothesis Tests Do two groups differ in their location? Wilcoxon s Rank Sum Test > t.y1 <- sleep[sleep[, group ]==1, extra ] > t.y2 <- sleep[sleep[, group ]==2, extra ] > wilcox.test(t.y1, t.y2, paired=false) Wilcoxon rank sum test with continuity correction data: t.y1 and t.y2 W = 25.5, p-value = 0.06933 alternative hyp.: true location shift not equal to 0

Hypothesis Tests 3.2 Hypothesis Tests More well-known: t-test. Assumes normal distributions. > t.test(t.y2,t.y1,alternative="two.sided", + paired=f) Welch Two Sample t-test data: t.y1 and t.y2 t = -1.8608, df = 17.776, p-value = 0.0794 alternative hyp.: true diff. in means not equal to 0 95 percent confidence interval: -3.365 0.205 sample estimates: mean of x mean of y 0.75 2.33 Confidence interval!

Two Groups 3.3 Two Groups Plots for two samples of data. > boxplot(t.y1,t.y2,ylab="extra") > plot(sleep[,"group"],sleep[,"extra"], + xlab="group", ylab="extra")

Statistical Models, Formula Objects 3.4 Statistical Models, Formula Objects Statistics is concerned with relations between variables. Prototype: Relationship between target variable Y and explanatory variables X1, X2,... Regression. Symbolic notation of such a relation: Y X1 + X2 This symbolic notation is an S object (of class formula ) (The notation is also used in other statistical packages.) Use of formula : > plot(punkte kugel + speer, + data = d.sport) gives 2 scatterplots, punkte (vertical) against kugel and speer, respectively (horizontal axis).

Statistical Models, Formula Objects 3.4 Statistical Models, Formula Objects Grouping or nominal or categorical variables, e.g., location, type, group, species, plot,... Role in models different from continuous variables S must know! stores them as factor s Character variables enter data.frame as factor s Grouping var. with numerical labels can be declared as factor > sleep[, group ] <- + factor(sleep[, group ]) > plot(extra group, data = sleep) produces two box plots.