R Language Fundamentals



Similar documents
R Language Fundamentals

Reading and writing files

rm(list=ls()) library(sqldf) system.time({large = read.csv.sql("large.csv")}) # seconds, 4.23GB of memory used by R

A Primer on Using PROC IML

Introduction to the data.table package in R

Using R for Windows and Macintosh

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

Dataframes. Lecture 8. Nicholas Christian BIOST 2094 Spring 2011

Matrix Multiplication

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

Typical Linear Equation Set and Corresponding Matrices

Introduction to Matrix Algebra

Creating Basic Excel Formulas

Operation Count; Numerical Linear Algebra

Basics of using the R software

Getting your data into R

Introduction to R June 2006

Introduction to Programming (in C++) Multi-dimensional vectors. Jordi Cortadella, Ricard Gavaldà, Fernando Orejas Dept. of Computer Science, UPC

8 Square matrices continued: Determinants

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Package retrosheet. April 13, 2015

Name: Class: Date: 9. The compiler ignores all comments they are there strictly for the convenience of anyone reading the program.

Commission Formula. Value If True Parameter Value If False Parameter. Logical Test Parameter

I PUC - Computer Science. Practical s Syllabus. Contents

OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE

In This Issue: Excel Sorting with Text and Numbers

Sources: On the Web: Slides will be available on:

ECONOMICS 351* -- Stata 10 Tutorial 2. Stata 10 Tutorial 2

Package dsmodellingclient

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Beginner s Matlab Tutorial

AP Computer Science Java Mr. Clausen Program 9A, 9B

Zachary Monaco Georgia College Olympic Coloring: Go For The Gold

Package dsstatsclient

Exploratory data analysis (Chapter 2) Fall 2011

The Undergraduate Guide to R

EXST SAS Lab Lab #4: Data input and dataset modifications

Microsoft Excel 2010 Part 3: Advanced Excel

1 Introduction to Matrices

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

1.2 Solving a System of Linear Equations

Linear Programming. March 14, 2014

Solving Mass Balances using Matrix Algebra

Matrix Algebra in R A Minimal Introduction

Data Analysis with Microsoft Excel 2003

Getting Started with R and RStudio 1

K80TTQ1EP-??,VO.L,XU0H5BY,_71ZVPKOE678_X,N2Y-8HI4VS,,6Z28DDW5N7ADY013

Statistical Models in R

A Program for PCB Estimation with Altium Designer

Data Mining: Algorithms and Applications Matrix Math Review

Lecture 2 Matrix Operations

Introduction to R and UNIX Working with microarray data in a multi-user environment

Basic Excel Handbook

Linear Algebra Review. Vectors

Package hoarder. June 30, 2015

Chapter 7: Additional Topics

Vendor: Crystal Decisions Product: Crystal Reports and Crystal Enterprise

Formulas & Functions in Microsoft Excel

Abstract: We describe the beautiful LU factorization of a square matrix (or how to write Gaussian elimination in terms of matrix multiplication).

SAS/IML - Interactive Matrix Language

Matrix Differentiation

Introduction to Matrices

Vendor: Brio Software Product: Brio Performance Suite

WESTMORELAND COUNTY PUBLIC SCHOOLS Integrated Instructional Pacing Guide and Checklist Computer Math

Here are some examples of combining elements and the operations used:

Commonly Used Excel Functions. Supplement to Excel for Budget Analysts

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Data Analysis Tools. Tools for Summarizing Data

LINEAR INEQUALITIES. less than, < 2x + 5 x 3 less than or equal to, greater than, > 3x 2 x 6 greater than or equal to,

Step by Step Guide to Importing Genetic Data into JMP Genomics

An Introduction to Hill Ciphers Using Linear Algebra

CURVE FITTING LEAST SQUARES APPROXIMATION

Georgia Standards of Excellence Curriculum Map. Mathematics. GSE 8 th Grade

MATHEMATICS FOR ENGINEERS BASIC MATRIX THEORY TUTORIAL 2

Matrices 2. Solving Square Systems of Linear Equations; Inverse Matrices

Exploratory Data Analysis and Plotting

Quantrix & Excel: 3 Key Differences A QUANTRIX WHITE PAPER

Excel 2003 Tutorials - Video File Attributes

Excel supplement: Chapter 7 Matrix and vector algebra

Cluster software and Java TreeView

Cluster Analysis using R

R: A self-learn tutorial

Notes on Determinant

Lecture 1: Systems of Linear Equations

Excel I Sorting and filtering Revised February 2013

Introduction to Matlab

LS.6 Solution Matrices

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Psychology 205: Research Methods in Psychology

Designed by Jason Wagner, Course Web Programmer, Office of e-learning ZPRELIMINARY INFORMATION... 1 LOADING THE INITIAL REPORT... 1 OUR EXAMPLE...

R FOR SAS AND SPSS USERS. Bob Muenchen

Chapter 3. The Normal Distribution

5. Orthogonal matrices

ECDL / ICDL Spreadsheets Syllabus Version 5.0

CS1112 Spring 2014 Project 4. Objectives. 3 Pixelation for Identity Protection. due Thursday, 3/27, at 11pm

with functions, expressions and equations which follow in units 3 and 4.

Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller

A Short Guide to R with RStudio

Transcription:

R Language Fundamentals Data Frames Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007

Tabular Data Frequently, experimental data are held in tables, an Excel worksheet, for example. Naturally, R has very robust structures for holding tabular data, including importing spreadsheets and saving to CSV files.

Outline Objects that Hold Data Matrices and Arrays Data Frames Reading and Writing Tables Selection and Sorting Data Frames

Matrix a square array of numbers The entries in a matrix X are arranged in rows and columns. Think of it as a two-dimensional version of a numeric vector. X is n m if it has n rows and m columns. Create a 3 4 matrix all of whose entries are 0: > X <- matrix(0, nrow = 3, ncol = 4) > X [,1] [,2] [,3] [,4] [1,] 0 0 0 0 [2,] 0 0 0 0 [3,] 0 0 0 0 > dim(x) [1] 3 4 Dimension, dim(x), is an integer vector giving the number of rows and columns.

Matrix Entries, Rows,... A more interesting matrix: > Y <- matrix(1:12, nrow = 3, ncol = 4) > Y [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 > Y[1, 3] [1] 7 > Y[1, ] [1] 1 4 7 10 > Y[, 2] [1] 4 5 6

Matrix=Vector with Dimension > x <- 1:15 > dim(x) <- c(3, 5) > x [,1] [,2] [,3] [,4] [,5] [1,] 1 4 7 10 13 [2,] 2 5 8 11 14 [3,] 3 6 9 12 15 > class(x) [1] "matrix"

Submatrices > Z <- X[1:2, 3:4] > Z [,1] [,2] [1,] 0 0 [2,] 0 0

Matrix Subsetting There are no (few) surprises. > x <- Y[1, ] > class(x) [1] "integer"

Matrix Assignment There are no (few) surprises. > X[1, 3] <- 1 > X[, 1] <- c(-1, -2, -3) > X [,1] [,2] [,3] [,4] [1,] -1 0 1 0 [2,] -2 0 0 0 [3,] -3 0 0 0 > X[, 4] <- 2 > X [,1] [,2] [,3] [,4] [1,] -1 0 1 2 [2,] -2 0 0 2 [3,] -3 0 0 2

Subsetting with Logicals Review for vectors Remember, if x is a vector and L a logical vector of the same length, x[l] is a (probably shorter) vector comprised of the entries where L is TRUE. If r is a number, x[l] <- r replaces the entry in x by r when L is TRUE and leaves the entry alone otherwise.

Subsetting with Logicals Tricky For a matrix X and logical L of matching dimensions, X[L] is the numeric vector of entries where L is TRUE. > X > 0 [,1] [,2] [,3] [,4] [1,] FALSE FALSE TRUE TRUE [2,] FALSE FALSE FALSE TRUE [3,] FALSE FALSE FALSE TRUE > X[X > 0] [1] 1 2 2 2

Assignment with Logicals Sometimes useful Assignments of the form X[L] <- r work as they do for vectors. > X[1, 1] <- NA > is.na(x) [,1] [,2] [,3] [,4] [1,] TRUE FALSE FALSE FALSE [2,] FALSE FALSE FALSE FALSE [3,] FALSE FALSE FALSE FALSE > X[is.na(X)] <- 0 > X [,1] [,2] [,3] [,4] [1,] 0 0 1 2 [2,] -2 0 0 2 [3,] -3 0 0 2

Matrix and Number Arithmetic For X a matrix and r a number X+r and X*r are the results of adding, resp. multiplying r entrywise to X. What if r is a vector? Experiment and find out. If X is n m and Y is m p, X % *% Y is the matrix product. t(x) is the transpose of X; i.e., the matrix obtained from X by switching rows and columns. diag(x) is the vector of elements on the main diagonal.

Row and Column Stats Frequently we ll want to extract statistics from the rows or columns of a matrix. Let f be a function that produces a number given a vector. If X is a matrix apply(x, 1, f) is the result of applying f to each row of X; apply(x, 2, f) to the columns. The former outputs a vector with one result for each row.

Row and Column Stats > Z <- matrix(rnorm(20), nrow = 4, ncol = 5) > V <- apply(z, 1, mean) > V [1] 0.1603 0.3566 0.1926-1.1196 > W <- apply(z, 2, min) > W [1] 0.3452-0.5668-1.5553-1.8376-2.8867

Names Just as you can name indices in a vector you can (and should!) name columns and rows in a matrix with colnames(x) and rownames(x). These can be used in subsetting just like vectors.

Getting the most varying genes Given a matrix of expression data find the 4000 probes (genes) that are most widely varying across the samples. Restrict the expression matrix to these probes. > load("/users/steve/documents/bio/breast/analysisrecord/da > class(expupps1pos) [1] "matrix" > dim(expupps1pos) [1] 22283 101 > expupps1pos[1:3, 1:4] GSM110625 GSM110629 GSM110631 GSM110635 1007_s_at 10.310 10.197 10.130 10.169 1053_at 5.071 6.227 5.498 5.684 117_at 4.151 4.171 4.105 3.927

Getting the most varying genes For each probe (row in the matrix) we need a measure of how much it is varying. We use inter-quartile range as the measure of change. For a vector x, IQR(x) is the third quartile minus the first quartile. > iqrexp <- apply(expupps1pos, 1, IQR) > names(iqrexp) <- rownames(expupps1pos)

Getting the most varying genes Sorting the measures iqrexp contains the measures of change for each probe. It is a numeric vector. We sort it to find the largest 4000. > siqrexp <- sort(iqrexp, decreasing = TRUE) > top4k <- names(siqrexp)[1:4000] > expupps1top <- expupps1pos[top4k, ] > dim(expupps1top) [1] 4000 101

Outline Objects that Hold Data Matrices and Arrays Data Frames Reading and Writing Tables Selection and Sorting Data Frames

Fundamental Object for Experimental Data A data.frame object in R has similar dimensional properties to a matrix but it may contain categorical data, as well as numeric. The standard is to put data for one sample across a row and covariates as columns. On one level, as the notation will reflect, a data frame is a list. Each component corresponds to a variable; i.e., the vector of values of a given variable for each sample. A data frame is like a list with components as columns of a table.

Data Frame Restrictions When can a list be made into a data.frame? Components must be vectors (numeric, character, logical) or factors. All vectors and factors must have the same lengths. Matrices and even other data frames can be combined with vectors to form a data frame if the dimensions match up.

Creating Data Frames Explicitly like a list > measrs <- data.frame(gender = c("m", "M", + "F"), ht = c(172, 186.5, 165), wt = c(91, + 99, 74)) > measrs gender ht wt 1 M 172.0 91 2 M 186.5 99 3 F 165.0 74 Entries in a data.frame are indexed like a matrix: > measrs[1, 2] [1] 172

Data Frame Attributes Both List and Matrix > names(measrs) [1] "gender" "ht" "wt" > rownames(measrs) <- c("s1", "S2", "S3") > measrs$ht [1] 172.0 186.5 165.0

Compenents as Vectors The components of a data frame can be extracted as a vector as in a list: > height <- measrs$ht > height [1] 172.0 186.5 165.0 > names(height) <- rownames(measrs) Warning: Character vectors in a data frame are always stored as a factor. It s assumed that s what you should do. > class(measrs$gend) [1] "factor"

Extracting ALL Compenents All components in a data frame can be extracted as vectors with the corresponding name: > attach(measrs) > wt [1] 91 99 74 > detach(measrs)

Expanding Data Frames Components can be added to a data frame in the natural way. > measrs$age <- c(28, 55, 43) > measrs gender ht wt age S1 M 172.0 91 28 S2 M 186.5 99 55 S3 F 165.0 74 43

Expanding Data Frames Row Bind, Column Bind If you expand the experiment to add data, use row binding to expand. > m2 <- data.frame(gender = c("m", "F"), + ht = c(170, 166), wt = c(68, 72), + age = c(38, 22)) > rownames(m2) <- c("s4", "S5") > measrs2 <- rbind(measrs, m2) If other data are kept on the same samples in another data frame it can be combined with the original using cbind.

Outline Objects that Hold Data Matrices and Arrays Data Frames Reading and Writing Tables Selection and Sorting Data Frames

Reading Tables The Working Directory R can read in files on your machine and create data files and graphics. Paths to these files are computed relative to the working directory. Paths are specified in the format appropriate for the machine. Typically, separate analyses are stored in different directories.

Can Move Between R and Excel Reading.CSV Excel allows you to save files as comma separated values. All formatting is lost but the information content is there. There is a function in R that reads the.csv file and produces a table of data. > setwd("lect2workdir") > clinupps1 <- read.csv(file = "Clinical_Upps.csv") > setwd("../") > class(clinupps1) [1] "data.frame"

Can Move Between R and Excel Writing.CSV > write.csv(measrs2, file = "measrs2.csv") These are special cases of the more general functions read.table, write.table. There are numerous options such as Is the first line header information? See the help entries.

Outline Objects that Hold Data Matrices and Arrays Data Frames Reading and Writing Tables Selection and Sorting Data Frames

Select Rows Based Variable Values Commonly, we ll want to select those rows in a data.frame in which one of the variables has specific values. The entries in measrs2 with height 170 are found as follows. > talls <- measrs2[measrs2$ht >= 170, ] > talls gender ht wt age S1 M 172.0 91 28 S2 M 186.5 99 55 S4 M 170.0 68 38

Select Rows Based Variable Values Combining variables Attaching the components of the data frame as variables allows for simpler formulas. > attach(measrs2) > tallnthin <- measrs2[ht >= 170 & wt <= + 75, ] > tallnthin gender ht wt age S4 M 170 68 38

Sort a Data Frame by Selected Column Often data are better viewed when sorted. The function order sorts a column and gives output that can sort the rows of a data.frame. The following sorts measrs2 by age. > measrsbyage <- measrs2[order(measrs2[, + "age"]), ] > measrsbyage gender ht wt age S5 F 166.0 72 22 S1 M 172.0 91 28 S4 M 170.0 68 38 S3 F 165.0 74 43 S2 M 186.5 99 55

Sort a Data Frame by Selected Column Reverse order rev reverses the order from increasing to decreasing. > measrsbyht <- measrs2[rev(order(measrs2[, + "ht"])), ] > measrsbyht gender ht wt age S2 M 186.5 99 55 S1 M 172.0 91 28 S4 M 170.0 68 38 S5 F 166.0 72 22 S3 F 165.0 74 43