Regression and Programming in R. Anja Bråthen Kristoffersen Biomedical Research Group

Similar documents

STATISTICA Formula Guide: Logistic Regression. Table of Contents

5 Correlation and Data Exploration

MULTIPLE REGRESSION EXAMPLE

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

5. Multiple regression

Multiple Linear Regression

Correlation and Simple Linear Regression

Univariate Regression

Simple Linear Regression Inference

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Getting started with qplot

Penalized regression: Introduction

Multivariate Logistic Regression

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

R Graphics II: Graphics for Exploratory Data Analysis

Week 5: Multiple Linear Regression

Factors affecting online sales

Estimation of σ 2, the variance of ɛ

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Simple Predictive Analytics Curtis Seare

Final Exam Practice Problem Answers

Statistical Models in R

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Viewing Ecological data using R graphics

R: A self-learn tutorial

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Basic Statistical and Modeling Procedures Using SAS

Additional sources Compilation of sources:

Regression Analysis: A Complete Example

Using R for Linear Regression

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Elements of statistics (MATH0487-1)

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Directions for using SPSS

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

Exploratory Data Analysis

Correlational Research

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

VI. Introduction to Logistic Regression

Chapter 7: Simple linear regression Learning Objectives

Simple Regression Theory II 2010 Samuel L. Baker

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Fitting Subject-specific Curves to Grouped Longitudinal Data

Systat: Statistical Visualization Software

Time Series Analysis with R - Part I. Walter Zucchini, Oleg Nenadić

Least Squares Estimation

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

ANOVA. February 12, 2015

BA 275 Review Problems - Week 6 (10/30/06-11/3/06) CD Lessons: 53, 54, 55, 56 Textbook: pp , ,

Data Analysis Tools. Tools for Summarizing Data

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16

Minitab Tutorials for Design and Analysis of Experiments. Table of Contents

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Using Excel for Statistical Analysis

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Pearson's Correlation Tests

Package neuralnet. February 20, 2015

A full analysis example Multiple correlations Partial correlations

Lecture 3: Linear methods for classification

SPSS Tests for Versions 9 to 13

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Least Squares Regression. Alan T. Arnholt Department of Mathematical Sciences Appalachian State University

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Forecast. Forecast is the linear function with estimated coefficients. Compute with predict command

ANNUITY LAPSE RATE MODELING: TOBIT OR NOT TOBIT? 1. INTRODUCTION

Specification of Rasch-based Measures in Structural Equation Modelling (SEM) Thomas Salzberger

Regression III: Advanced Methods

Regression step-by-step using Microsoft Excel

Moderator and Mediator Analysis

Graphics in R. Biostatistics 615/815

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Causal Forecasting Models

UNIT 1: COLLECTING DATA

Generalized Linear Models

Package smoothhr. November 9, 2015

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Psychology 205: Research Methods in Psychology

7 Time series analysis

Latent Class Regression Part II

Using Excel for inferential statistics

Chapter 15. Mixed Models Overview. A flexible approach to correlated data.

SAS Software to Fit the Generalized Linear Model

Title: Lending Club Interest Rates are closely linked with FICO scores and Loan Length

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Data Mining Lab 5: Introduction to Neural Networks

GLM I An Introduction to Generalized Linear Models

How To Test For Significance On A Data Set

Using R for Windows and Macintosh

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Practical Differential Gene Expression. Introduction

Penalized Logistic Regression and Classification of Microarray Data

An introduction to IBM SPSS Statistics

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Below is a very brief tutorial on the basic capabilities of Excel. Refer to the Excel help files for more information.

Transcription:

Regression and Programming in R Anja Bråthen Kristoffersen Biomedical Research Group

R Reference Card http://cran.r-project.org/doc/contrib/short-refcard.pdf

Simple linear regression Describes the relationship between two variables x and y y x The numbers α and β are called parameters, and ϵ is the error term. 2014.01.15 3

Example data Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. head(faithful) eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55 2014.01.15 4

Estimated parameters P-value Coefficient of Determination 2014.01.15 5

Coefficient of determination the quotient of the variances of the fitted values and observed values of the dependent variable. r 2 yˆ i y y y i 2 2 2014.01.15 6

Prediction develop a 95% confidence interval of the mean eruption duration for the waiting time of 80 minutes newdata <- data.frame(waiting=80) predict(eruption.lm, newdata, interval="confidence") fit lwr upr 1 4.1762 4.1048 4.2476 2014.01.15 7

Residuals The difference between the observed data of the dependent variable y and the fitted values ŷ. Residual y yˆ eruption.res <- resid(eruption.lm) plot(faithful$waiting, eruption.res, ylab="residuals", xlab="waiting Time", main="old Faithful Eruptions") abline(0,0, 2014.01.15 col = "red", lwd = 2) 8

2014.01.15 9

Standardized residual the residual divided by its standard deviation Standardiz ed residual i standard residual deviation i of residual i eruption.stdres = rstandard(eruption.lm) plot(faithful$waiting, eruption.stdres, ylab="standardized Residuals", xlab="waiting Time", main="old Faithful Eruptions") abline(0, 0, col ="red", lwd = 2) 2014.01.15 10

2014.01.15 11

Normal Probability Plot of Residuals qqnorm(eruption.stdres, ylab="standardized Residuals", xlab="normal Scores", main="old Faithful Eruptions") qqline(eruption.stdres, col = "red", lwd=3) 2014.01.15 12

Generalized additive model Can replace the linear relationship between response and variable, as in linear regression, with a non-linear relationship. A spline. 2014.01.15 13

GAM install.packages("mgcv") library(mgcv) This is mgcv 1.5-5. For overview type `help("mgcv-package")'. eruption.gam <- gam(eruptions ~ 1+s(waiting), data=faithful) plot(eruption.gam) 2014.01.15 14

plot(eruption.gam) 2014.01.15 15

p-value Coefficient of Determination 2014.01.15 16

eruption.res.gam <- resid(eruption.gam) plot(faithful$waiting, eruption.res.gam, ylab="residuals", xlab="waiting Time", main="old Faithful Eruptions") abline(0,0, col="red", lwd=2) 2014.01.15 17

eruption.res.gam <- resid(eruption.gam) qqnorm(eruption.res.gam) qqline(eruption.res.gam, col="red", lwd=3) 2014.01.15 18

Multiple regresion describes a dependent variable y (response) by independent variables x 1, x 2,..., x p (p > 1) is expressed by the equation y k k x k where the numbers α and β k (k = 1, 2,..., p) are the parameters, and ϵ is the error term. 2014.01.15 19

Multiple regresion Explore your data Explore your data before starting the analysis Are the responses correlated? Use plot (trellis graphics, boxplot) Use correlation (pearson, spearman) 2014.01.15 20

DLBCL patient data Response: Germinal.cnter.B.cell.signature Explanatory variables Lymph.node.signature Proliferation.signature BMP6 MHC.class.II.signature 2014.01.15 21 2014.01.15 21

pairs(dat[,8:11]) 2014.01.15 22

cor(dat[,8:11]) 2014.01.15 23

p-vaules Not significant p values 2014.01.15 24 Adjusted Coefficient of Determination

Adjusted Coefficient of Determination 2 R adj 1 (1 R 2 ) n n p 1 1 Hvor n er antall observasjoner og p er antall parametere brukt i modellen 2014.01.15 25

Comparing models As long as analysis are done on the same responses you can compare your models by information criteria: AIC (Akaike's An Information Criterion) -2*log-likelihood + 2*npar BIC (Schwarz's Bayesian criterion) -2*log-likelihood + log(n)*npar Goal: as small AIC or BIC as possible, i.e. explain most with less parameters 2014.01.15 26

All p values are significant The AIC value is less, chose this model Adjusted coefficient of determination hardly changed 27

fit2.res <- resid(fit2) fit2.fitted <- fitted(fit2) pairs(fit2.fitted, fit2.res, col = "darkgreen", pch="*") 2014.01.15 28

Logistic regression We use the logistic regression equation to predict the probability of a dependent variable that takes on only the values 0 and 1. Suppose x 1, x 2,..., x p (p > 1) are the independent variables, α, β k (k = 1, 2,..., p) are the parameters, and E(y) is the expected value of the dependent variable y E ( y ) k x k k 1 e 2014.01.15 29 1

DLBCL patient data Response: alive or dead Explanatory variables Subgroup IPI.Group Germinal.center.B.cell.signature Lymph.node.signature Proliferation.signature BMP6 MHC.class.II.signature 2014.01.15 30

DLBCL.glm <- glm(status.at.follow.up ~ Subgroup + IPI.Group + Germinal.center.B.cell.signature + Lymph.node.signature + Proliferation.signature + BMP6 + MHC.class.II.signature, family= "binomial", data = dat) Here logistic regression is chosen 2014.01.15 31

DLBCL2.glm <- glm(status.at.follow.up ~ IPI.Group, family= "binomial", data = dat) summary(dlbcl2.glm) High Low Medium missing NA's 32 82 108 1 17 The estimate is negative, hence pations in group Low have less probability to die then those in group High The survival of those containing the IPI.group Low is significantly different from those that are in the group High 2014.01.15 32

Backward an forward inclusion of Forward response variables Start with all response variables, exclude the one that is least significant, compare AIC values. Backward Start with one regression for each response variable Include more and more response variables that are significant in the model, compare AIC values 2014.01.15 33

Programming in R 2014.01.15 34

Function panel.hist <- function(x){ usrx <- par("usr") on.exit(par(usrx)) par(usr = c(usrx[1:2], 0, 1.5) ) # indicates the position in the plot hi <- hist(x, plot = FALSE) # calculate histogram without plotting it Breaks <- hi$breaks # define breaks used in h nb <- length(breaks) # count the number of breaks y <- hi$counts # find the counts in each interval y <- y/max(y) # scale y for plotting rect(breaks[-nb], 0, Breaks[-1], y) #plots rectangulars in existing plot } pairs(dat[,7:11], col = dat[,5], cex = 0.5, pch = 24, diag.panel = panel.hist, upper.panel=null)

2014.01.15 36

2014.01.15 37

2014.01.15 38

apply() Use apply when you want the same thing done on every row or column of a matrix. apply(d, 1, functiona) will use functiona on every row of dataset d. apply(d, 2, functiona) will use functiona on every column of dataset d. 2014.01.15 39

Example: apply() d <- matrix(runif(90), ncol = 10, nrow = 9) head(d) par(mfrow = c(3,3)) apply(d, 1, hist) 2014.01.15 40

for loop for (name in expr_1) expr_2 name on variable: i expr_1 should be 1:antall expr_2 should be sum(dbinom(1:6, 8, p[i])), but you have to save it somewhere: beta8[i] <- sum(dbinom(1:6, 8, p[i])) then you have to create beta8 before the loop.

for loop p <- seq(0.6, 1, 0.01) antall <- length(p) beta8 <- rep(na, antall) #allocate space for beta8 for(i in 1:antall){ beta8[i] <- sum(dbinom(1:6, 8, p[i])) } power8 <- 1 - beta8 plot(p, power8, type = "l")

2014.01.15 43

if statment if (expr_1) expr_2 else expr_3 expr_1 is a statement that is either true or false expr_2 is preformed if expr_1 is true expr_3 is optional, but preformed if expr_1 is false and else statement is included

Example: if() x <- runif(1,0,1) y <- rnorm(1,0,1) if(x > y){ print(paste("x =", round(x,3), "is greater than y =", round(y,3), sep = " ")) } else { print(paste("x =", round(x,3), "is less than y =", round(y,3), sep = " ")) } [1] "x = 0.783 is greater than y 0.536" 2014.01.15 45

When finished You could save your workspace in R, you could then save the workspace in different project folders. I do not do that! You could just save your script and run it once again if you want to work further on it. You could write your partial results (matrix) that you want to work further on to a.txt file, use write.table(objectname, path, sep = \t ) objectname is the name of your object in R path is your path to where you will save the object including the file name of the object. sep = \t ensure that your txt file is tabulate separated and makes it easier to open in excel.