Regression and Programming in R Anja Bråthen Kristoffersen Biomedical Research Group
R Reference Card http://cran.r-project.org/doc/contrib/short-refcard.pdf
Simple linear regression Describes the relationship between two variables x and y y x The numbers α and β are called parameters, and ϵ is the error term. 2014.01.15 3
Example data Waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. head(faithful) eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55 2014.01.15 4
Estimated parameters P-value Coefficient of Determination 2014.01.15 5
Coefficient of determination the quotient of the variances of the fitted values and observed values of the dependent variable. r 2 yˆ i y y y i 2 2 2014.01.15 6
Prediction develop a 95% confidence interval of the mean eruption duration for the waiting time of 80 minutes newdata <- data.frame(waiting=80) predict(eruption.lm, newdata, interval="confidence") fit lwr upr 1 4.1762 4.1048 4.2476 2014.01.15 7
Residuals The difference between the observed data of the dependent variable y and the fitted values ŷ. Residual y yˆ eruption.res <- resid(eruption.lm) plot(faithful$waiting, eruption.res, ylab="residuals", xlab="waiting Time", main="old Faithful Eruptions") abline(0,0, 2014.01.15 col = "red", lwd = 2) 8
2014.01.15 9
Standardized residual the residual divided by its standard deviation Standardiz ed residual i standard residual deviation i of residual i eruption.stdres = rstandard(eruption.lm) plot(faithful$waiting, eruption.stdres, ylab="standardized Residuals", xlab="waiting Time", main="old Faithful Eruptions") abline(0, 0, col ="red", lwd = 2) 2014.01.15 10
2014.01.15 11
Normal Probability Plot of Residuals qqnorm(eruption.stdres, ylab="standardized Residuals", xlab="normal Scores", main="old Faithful Eruptions") qqline(eruption.stdres, col = "red", lwd=3) 2014.01.15 12
Generalized additive model Can replace the linear relationship between response and variable, as in linear regression, with a non-linear relationship. A spline. 2014.01.15 13
GAM install.packages("mgcv") library(mgcv) This is mgcv 1.5-5. For overview type `help("mgcv-package")'. eruption.gam <- gam(eruptions ~ 1+s(waiting), data=faithful) plot(eruption.gam) 2014.01.15 14
plot(eruption.gam) 2014.01.15 15
p-value Coefficient of Determination 2014.01.15 16
eruption.res.gam <- resid(eruption.gam) plot(faithful$waiting, eruption.res.gam, ylab="residuals", xlab="waiting Time", main="old Faithful Eruptions") abline(0,0, col="red", lwd=2) 2014.01.15 17
eruption.res.gam <- resid(eruption.gam) qqnorm(eruption.res.gam) qqline(eruption.res.gam, col="red", lwd=3) 2014.01.15 18
Multiple regresion describes a dependent variable y (response) by independent variables x 1, x 2,..., x p (p > 1) is expressed by the equation y k k x k where the numbers α and β k (k = 1, 2,..., p) are the parameters, and ϵ is the error term. 2014.01.15 19
Multiple regresion Explore your data Explore your data before starting the analysis Are the responses correlated? Use plot (trellis graphics, boxplot) Use correlation (pearson, spearman) 2014.01.15 20
DLBCL patient data Response: Germinal.cnter.B.cell.signature Explanatory variables Lymph.node.signature Proliferation.signature BMP6 MHC.class.II.signature 2014.01.15 21 2014.01.15 21
pairs(dat[,8:11]) 2014.01.15 22
cor(dat[,8:11]) 2014.01.15 23
p-vaules Not significant p values 2014.01.15 24 Adjusted Coefficient of Determination
Adjusted Coefficient of Determination 2 R adj 1 (1 R 2 ) n n p 1 1 Hvor n er antall observasjoner og p er antall parametere brukt i modellen 2014.01.15 25
Comparing models As long as analysis are done on the same responses you can compare your models by information criteria: AIC (Akaike's An Information Criterion) -2*log-likelihood + 2*npar BIC (Schwarz's Bayesian criterion) -2*log-likelihood + log(n)*npar Goal: as small AIC or BIC as possible, i.e. explain most with less parameters 2014.01.15 26
All p values are significant The AIC value is less, chose this model Adjusted coefficient of determination hardly changed 27
fit2.res <- resid(fit2) fit2.fitted <- fitted(fit2) pairs(fit2.fitted, fit2.res, col = "darkgreen", pch="*") 2014.01.15 28
Logistic regression We use the logistic regression equation to predict the probability of a dependent variable that takes on only the values 0 and 1. Suppose x 1, x 2,..., x p (p > 1) are the independent variables, α, β k (k = 1, 2,..., p) are the parameters, and E(y) is the expected value of the dependent variable y E ( y ) k x k k 1 e 2014.01.15 29 1
DLBCL patient data Response: alive or dead Explanatory variables Subgroup IPI.Group Germinal.center.B.cell.signature Lymph.node.signature Proliferation.signature BMP6 MHC.class.II.signature 2014.01.15 30
DLBCL.glm <- glm(status.at.follow.up ~ Subgroup + IPI.Group + Germinal.center.B.cell.signature + Lymph.node.signature + Proliferation.signature + BMP6 + MHC.class.II.signature, family= "binomial", data = dat) Here logistic regression is chosen 2014.01.15 31
DLBCL2.glm <- glm(status.at.follow.up ~ IPI.Group, family= "binomial", data = dat) summary(dlbcl2.glm) High Low Medium missing NA's 32 82 108 1 17 The estimate is negative, hence pations in group Low have less probability to die then those in group High The survival of those containing the IPI.group Low is significantly different from those that are in the group High 2014.01.15 32
Backward an forward inclusion of Forward response variables Start with all response variables, exclude the one that is least significant, compare AIC values. Backward Start with one regression for each response variable Include more and more response variables that are significant in the model, compare AIC values 2014.01.15 33
Programming in R 2014.01.15 34
Function panel.hist <- function(x){ usrx <- par("usr") on.exit(par(usrx)) par(usr = c(usrx[1:2], 0, 1.5) ) # indicates the position in the plot hi <- hist(x, plot = FALSE) # calculate histogram without plotting it Breaks <- hi$breaks # define breaks used in h nb <- length(breaks) # count the number of breaks y <- hi$counts # find the counts in each interval y <- y/max(y) # scale y for plotting rect(breaks[-nb], 0, Breaks[-1], y) #plots rectangulars in existing plot } pairs(dat[,7:11], col = dat[,5], cex = 0.5, pch = 24, diag.panel = panel.hist, upper.panel=null)
2014.01.15 36
2014.01.15 37
2014.01.15 38
apply() Use apply when you want the same thing done on every row or column of a matrix. apply(d, 1, functiona) will use functiona on every row of dataset d. apply(d, 2, functiona) will use functiona on every column of dataset d. 2014.01.15 39
Example: apply() d <- matrix(runif(90), ncol = 10, nrow = 9) head(d) par(mfrow = c(3,3)) apply(d, 1, hist) 2014.01.15 40
for loop for (name in expr_1) expr_2 name on variable: i expr_1 should be 1:antall expr_2 should be sum(dbinom(1:6, 8, p[i])), but you have to save it somewhere: beta8[i] <- sum(dbinom(1:6, 8, p[i])) then you have to create beta8 before the loop.
for loop p <- seq(0.6, 1, 0.01) antall <- length(p) beta8 <- rep(na, antall) #allocate space for beta8 for(i in 1:antall){ beta8[i] <- sum(dbinom(1:6, 8, p[i])) } power8 <- 1 - beta8 plot(p, power8, type = "l")
2014.01.15 43
if statment if (expr_1) expr_2 else expr_3 expr_1 is a statement that is either true or false expr_2 is preformed if expr_1 is true expr_3 is optional, but preformed if expr_1 is false and else statement is included
Example: if() x <- runif(1,0,1) y <- rnorm(1,0,1) if(x > y){ print(paste("x =", round(x,3), "is greater than y =", round(y,3), sep = " ")) } else { print(paste("x =", round(x,3), "is less than y =", round(y,3), sep = " ")) } [1] "x = 0.783 is greater than y 0.536" 2014.01.15 45
When finished You could save your workspace in R, you could then save the workspace in different project folders. I do not do that! You could just save your script and run it once again if you want to work further on it. You could write your partial results (matrix) that you want to work further on to a.txt file, use write.table(objectname, path, sep = \t ) objectname is the name of your object in R path is your path to where you will save the object including the file name of the object. sep = \t ensure that your txt file is tabulate separated and makes it easier to open in excel.