Graphics and Data Visualization in R



Similar documents
Getting started with qplot

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

Data Visualization in R

Viewing Ecological data using R graphics

R Graphics II: Graphics for Exploratory Data Analysis

Getting Started with R and RStudio 1

5 Correlation and Data Exploration

Graphics in R. Biostatistics 615/815

Scientific data visualization

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Data Visualization with R Language

Each function call carries out a single task associated with drawing the graph.

How To Cluster

Graphical Representation of Multivariate Data

Cluster Analysis using R

Hierarchical Clustering Analysis

Chapter 4 Creating Charts and Graphs

Tutorial 2: Descriptive Statistics and Exploratory Data Analysis

Big Data and Scripting. Plotting in R

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

Lecture 2: Exploratory Data Analysis with R

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Visualization Quick Guide

Data Exploration Data Visualization

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Plotting: Customizing the Graph

Data Mining and Visualization

Formulas, Functions and Charts

Diagrams and Graphs of Statistical Data

Introduction Course in SPSS - Evening 1

Prof. Nicolai Meinshausen Regression FS R Exercises

MARS STUDENT IMAGING PROJECT

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Information Visualization Multivariate Data Visualization Krešimir Matković

Package RIGHT. March 30, 2015

Journal of Statistical Software

All Visualizations Documentation

Microsoft Excel 2010 Charts and Graphs

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Figure 1. An embedded chart on a worksheet.

Linear Discriminant Analysis

Analysis of System Performance IN2072 Chapter M Matlab Tutorial

Exploratory Spatial Data Analysis

Scientific Graphing in Excel 2010

Tutorial for proteome data analysis using the Perseus software platform

Data Visualization Handbook

Exploratory Data Analysis and Plotting

STC: Descriptive Statistics in Excel Running Descriptive and Correlational Analysis in Excel 2013

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Psychology 205: Research Methods in Psychology

Common Tools for Displaying and Communicating Data for Process Improvement

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Visualizing Data: Scalable Interactivity

Distances, Clustering, and Classification. Heatmaps

Interactive Voting System. IVS-Basic IVS-Professional 4.4

Advanced Statistical Methods in Insurance

Data exploration with Microsoft Excel: analysing more than one variable

Exercise 1.12 (Pg )

Tutorial: ggplot2. Ramon Saccilotto Universitätsspital Basel Hebelstrasse 10 T F saccilottor@uhbs.ch

03 The full syllabus. 03 The full syllabus continued. For more information visit PAPER C03 FUNDAMENTALS OF BUSINESS MATHEMATICS

Visualization methods for patent data

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Portal Connector Fields and Widgets Technical Documentation

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

MultiExperiment Viewer Quickstart Guide

Data Visualization. Scientific Principles, Design Choices and Implementation in LabKey. Cory Nathe Software Engineer, LabKey

Data representation and analysis in Excel

A Demonstration of Hierarchical Clustering

Multivariate Analysis of Ecological Data

2. Simple Linear Regression

MicroStrategy Analytics Express User Guide

Analytics on Big Data

sample median Sample quartiles sample deciles sample quantiles sample percentiles Exercise 1 five number summary # Create and view a sorted

Spreadsheet. Parts of a Spreadsheet. Entry Bar

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Package tagcloud. R topics documented: July 3, 2015

Instructions for SPSS 21

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Exploratory data analysis (Chapter 2) Fall 2011

Using Excel for Handling, Graphing, and Analyzing Scientific Data:

Using Excel for descriptive statistics

Periodontology. Digital Art Guidelines JOURNAL OF. Monochrome Combination Halftones (grayscale or color images with text and/or line art)

A Layered Grammar of Graphics

Statistics Revision Sheet Question 6 of Paper 2

Bar Graphs and Dot Plots

Creating Charts in Microsoft Excel A supplement to Chapter 5 of Quantitative Approaches in Business Studies

GeoGebra Statistics and Probability

Final Project Report

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Transcription:

Graphics and Data Visualization in R Overview Thomas Girke December 13, 2013 Graphics and Data Visualization in R Slide 1/121

Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Slide 2/121

Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Overview Slide 3/121

Graphics in R Powerful environment for visualizing scientific data Integrated graphics and statistics infrastructure Publication quality graphics Fully programmable Highly reproducible Full L A TEX Link & Sweave Link support Vast number of R packages with graphics utilities Graphics and Data Visualization in R Overview Slide 4/121

Documentation on Graphics in R General Graphics Task Page Link R Graph Gallery Link R Graphical Manual Link Paul Murrell s book R (Grid) Graphics Link Interactive graphics rggobi (GGobi) Link iplots Link Open GL (rgl) Link Graphics and Data Visualization in R Overview Slide 5/121

Graphics Environments Viewing and saving graphics in R On-screen graphics postscript, pdf, svg jpeg/png/wmf/tiff/... Four major graphic environments Low-level infrastructure R Base Graphics (low- and high-level) grid: Manual Link, Book Link High-level infrastructure lattice: Manual Link, Intro Link, Book Link ggplot2: Manual Link, Intro Link, Book Link Graphics and Data Visualization in R Overview Slide 6/121

Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Graphics Environments Slide 7/121

Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 8/121

Base Graphics: Overview Important high-level plotting functions plot: generic x-y plotting barplot: bar plots boxplot: box-and-whisker plot hist: histograms pie: pie charts dotchart: cleveland dot plots image, heatmap, contour, persp: functions to generate image-like plots qqnorm, qqline, qqplot: distribution comparison plots pairs, coplot: display of multivariant data Help on these functions?myfct?plot?par Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 9/121

Base Graphics: Preferred Input Data Objects Matrices and data frames Vectors Named vectors Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 10/121

Scatter Plot: very basic Sample data set for subsequent plots > set.seed(1410) > y <- matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3])) > plot(y[,1], y[,2]) y[, 2] 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 y[, 1] Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 11/121

Scatter Plot: all pairs > pairs(y) A 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 B 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 C Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 12/121

Scatter Plot: with labels > plot(y[,1], y[,2], pch=20, col="red", main="symbols and Labels") > text(y[,1]+0.03, y[,2], rownames(y)) Symbols and Labels y[, 2] 0.2 0.4 0.6 0.8 j a f h b e g d i c 0.2 0.4 0.6 0.8 y[, 1] Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 13/121

Scatter Plots: more examples Print instead of symbols the row names > plot(y[,1], y[,2], type="n", main="plot of Labels") > text(y[,1], y[,2], rownames(y)) Usage of important plotting parameters > grid(5, 5, lwd = 2) > op <- par(mar=c(8,8,8,8), bg="lightblue") > plot(y[,1], y[,2], type="p", col="red", cex.lab=1.2, cex.axis=1.2, + cex.main=1.2, cex.sub=1, lwd=4, pch=20, xlab="x label", + ylab="y label", main="my Main", sub="my Sub") > par(op) Important arguments mar: specifies the margin sizes around the plotting area in order: c(bottom, left, top, right) col: color of symbols pch: type of symbols, samples: example(points) lwd: size of symbols cex.*: control font sizes For details see?par Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 14/121

Scatter Plots: more examples Add a regression line to a plot > plot(y[,1], y[,2]) > myline <- lm(y[,2]~y[,1]); abline(myline, lwd=2) > summary(myline) Same plot as above, but on log scale > plot(y[,1], y[,2], log="xy") Add a mathematical expression to a plot > plot(y[,1], y[,2]); text(y[1,1], y[1,2], > expression(sum(frac(1,sqrt(x^2*pi)))), cex=1.3) Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 15/121

Exercise 1: Scatter Plots Task 1 Generate scatter plot for first two columns in iris data frame and color dots by its Species column. Task 2 Use the xlim/ylim arguments to set limits on the x- and y-axes so that all data points are restricted to the left bottom quadrant of the plot. Structure of iris data set: > class(iris) [1] "data.frame" > iris[1:4,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa > table(iris$species) setosa versicolor virginica 50 50 50 Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 16/121

Line Plot: Single Data Set > plot(y[,1], type="l", lwd=2, col="blue") y[, 1] 0.2 0.4 0.6 0.8 2 4 6 8 10 Index Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 17/121

Line Plots: Many Data Sets > split.screen(c(1,1)); [1] 1 > plot(y[,1], ylim=c(0,1), xlab="measurement", ylab="intensity", type="l", lwd=2, col=1) > for(i in 2:length(y[1,])) { + screen(1, new=false) + plot(y[,i], ylim=c(0,1), type="l", lwd=2, col=i, xaxt="n", yaxt="n", ylab="", + xlab="", main="", bty="n") + } > close.screen(all=true) Intensity 0.0 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 Measurement Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 18/121

Bar Plot Basics > barplot(y[1:4,], ylim=c(0, max(y[1:4,])+0.3), beside=true, + legend=letters[1:4]) > text(labels=round(as.vector(as.matrix(y[1:4,])),2), x=seq(1.5, 13, by=1) + +sort(rep(c(0,1,2), 4)), y=as.vector(as.matrix(y[1:4,]))+0.04) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.27 0.53 0.93 0.14 0.47 0.31 0.12 0.05 0.44 0.32 A B C 0 a b c d 0.41 Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 19/121

Bar Plots with Error Bars > bar <- barplot(m <- rowmeans(y) * 10, ylim=c(0, 10)) > stdev <- sd(t(y)) > arrows(bar, m, bar, m + stdev, length=0.15, angle = 90) 0 2 4 6 8 10 a b c d e f g h i j Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 20/121

Mirrored Bar Plots > df <- data.frame(group = rep(c("above", "Below"), each=10), x = rep(1:10, 2), y > plot(c(0,12),range(df$y),type = "n") > barplot(height = df$y[df$group == 'Above'], add = TRUE,axes = FALSE) > barplot(height = df$y[df$group == 'Below'], add = TRUE,axes = FALSE) range(df$y) 1.0 0.5 0.0 0.5 0 2 4 6 8 10 12 c(0, 12) Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 21/121

Histograms > hist(y, freq=true, breaks=10) Histogram of y Frequency 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 y Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 22/121

Density Plots > plot(density(y), col="red") density.default(x = y) Density 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.5 1.0 N = 30 Bandwidth = 0.136 Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 23/121

Pie Charts > pie(y[,1], col=rainbow(length(y[,1]), start=0.1, end=0.8), clockwise=true) > legend("topright", legend=row.names(y), cex=1.3, bty="n", pch=15, pt.cex=1.8, + col=rainbow(length(y[,1]), start=0.1, end=0.8), ncol=1) a b c d e f g h i j a b c d e f g h i j Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 24/121

Color Selection Utilities Default color palette and how to change it > palette() [1] "black" "red" "green3" "blue" "cyan" "magenta" "yellow" "gray" > palette(rainbow(5, start=0.1, end=0.2)) > palette() [1] "#FF9900" "#FFBF00" "#FFE600" "#F2FF00" "#CCFF00" > palette("default") The gray function allows to select any type of gray shades by providing values from 0 to 1 > gray(seq(0.1, 1, by= 0.2)) [1] "#1A1A1A" "#4D4D4D" "#808080" "#B3B3B3" "#E6E6E6" Color gradients with colorpanel function from gplots library > library(gplots) > colorpanel(5, "darkblue", "yellow", "white") Much more on colors in R see Earl Glynn s color chart Link Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 25/121

Arranging Several Plots on Single Page With par(mfrow=c(nrow,ncol)) one can define how several plots are arranged next to each other. > par(mfrow=c(2,3)); for(i in 1:6) { plot(1:10) } 1:10 2 4 6 8 10 1:10 2 4 6 8 10 1:10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 2 4 6 8 10 Index Index Index 1:10 2 4 6 8 10 1:10 2 4 6 8 10 1:10 2 4 6 8 10 2 4 6 8 10 Index 2 4 6 8 10 Index 2 4 6 8 10 Index Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 26/121

Arranging Plots with Variable Width The layout function allows to divide the plotting device into variable numbers of rows and columns with the column-widths and the row-heights specified in the respective arguments. > nf <- layout(matrix(c(1,2,3,3), 2, 2, byrow=true), c(3,7), c(5,5), + respect=true) > # layout.show(nf) > for(i in 1:3) { barplot(1:10) } 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 27/121

Saving Graphics to Files After the pdf() command all graphs are redirected to file test.pdf. Works for all common formats similarly: jpeg, png, ps, tiff,... > pdf("test.pdf"); plot(1:10, 1:10); dev.off() Generates Scalable Vector Graphics (SVG) files that can be edited in vector graphics programs, such as InkScape. > svg("test.svg"); plot(1:10, 1:10); dev.off() Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 28/121

Exercise 2: Bar Plots Task 1 Calculate the mean values for the Species components of the first four columns in the iris data set. Organize the results in a matrix where the row names are the unique values from the iris Species column and the column names are the same as in the first four iris columns. Task 2 Generate two bar plots: one with stacked bars and one with horizontally arranged bars. Structure of iris data set: > class(iris) [1] "data.frame" > iris[1:4,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa > table(iris$species) setosa versicolor virginica 50 50 50 Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 29/121

Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Graphics Environments Grid Graphics Slide 30/121

grid Graphics Environment What is grid? Low-level graphics system Highly flexible and controllable system Does not provide high-level functions Intended as development environment for custom plotting functions Pre-installed on new R distributions Documentation and Help Manual Link Book Link Graphics and Data Visualization in R Graphics Environments Grid Graphics Slide 31/121

Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Graphics Environments lattice Slide 32/121

lattice Environment What is lattice? High-level graphics system Developed by Deepayan Sarkar Implements Trellis graphics system from S-Plus Simplifies high-level plotting tasks: arranging complex graphical features Syntax similar to R s base graphics Documentation and Help Manual Link Intro Link Book Link library(help=lattice) opens a list of all functions available in the lattice package Accessing and changing global parameters:?lattice.options and?trellis.device Graphics and Data Visualization in R Graphics Environments lattice Slide 33/121

Scatter Plot Sample > library(lattice) > p1 <- xyplot(1:8 ~ 1:8 rep(letters[1:4], each=2), as.table=true) > plot(p1) 2 4 6 8 8 A B 6 4 2 1:8 C D 8 6 4 2 2 4 6 8 1:8 Graphics and Data Visualization in R Graphics Environments lattice Slide 34/121

Line Plot Sample > library(lattice) > p2 <- parallelplot(~iris[1:4] Species, iris, horizontal.axis = FALSE, + layout = c(1, 3, 1)) > plot(p2) Max virginica Min versicolor Max Min Max setosa Min Sepal.Length Sepal.Width Petal.Length Petal.Width Graphics and Data Visualization in R Graphics Environments lattice Slide 35/121

Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 36/121

ggplot2 Environment What is ggplot2? High-level graphics system Implements grammar of graphics from Leland Wilkinson Link Streamlines many graphics workflows for complex plots Syntax centered around main ggplot function Simpler qplot function provides many shortcuts Documentation and Help Manual Link Intro Link Book Link Cookbook for R Link Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 37/121

ggplot2 Usage ggplot function accepts two arguments Data set to be plotted Aesthetic mappings provided by aes function Additional parameters such as geometric objects (e.g. points, lines, bars) are passed on by appending them with + as separator. List of available geom_* functions: Settings of plotting theme can be accessed with the command theme_get() and its settings can be changed with theme(). Preferred input data object qgplot: data.frame (support for vector, matrix,...) ggplot: data.frame Packages with convenience utilities to create expected inputs plyr reshape Link Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 38/121

qplot Function qplot syntax is similar to R s basic plot function Arguments: x: x-coordinates (e.g. col1) y: y-coordinates (e.g. col2) data: data frame with corresponding column names xlim, ylim: e.g. xlim=c(0,10) log: e.g. log="x" or log="xy" main: main title; see?plotmath for mathematical formula xlab, ylab: labels for the x- and y-axes color, shape, size...: many arguments accepted by plot function Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 39/121

qplot: Scatter Plots Create sample data > library(ggplot2) > x <- sample(1:10, 10); y <- sample(1:10, 10); cat <- rep(c("a", "B"), 5) Simple scatter plot > qplot(x, y, geom="point") Prints dots with different sizes and colors > qplot(x, y, geom="point", size=x, color=cat, + main="dot Size and Color Relative to Some Values") Drops legend > qplot(x, y, geom="point", size=x, color=cat) + + theme(legend.position = "none") Plot different shapes > qplot(x, y, geom="point", size=5, shape=cat) Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 40/121

qplot: Scatter Plot with qplot > p <- qplot(x, y, geom="point", size=x, color=cat, + main="dot Size and Color Relative to Some Values") + + theme(legend.position = "none") > print(p) Dot Size and Color Relative to Some Values 10.0 7.5 y 5.0 2.5 2.5 5.0 7.5 10.0 x Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 41/121

qplot: Scatter Plot with Regression Line > set.seed(1410) > dsmall <- diamonds[sample(nrow(diamonds), 1000), ] > p <- qplot(carat, price, data = dsmall, geom = c("point", "smooth"), + method = "lm") > print(p) 0 10000 20000 1 2 3 carat price Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 42/121

qplot: Scatter Plot with Local Regression Curve (loess) > p <- qplot(carat, price, data=dsmall, geom=c("point", "smooth"), span=0.4) > print(p) # Setting 'se=false' removes error shade 0 5000 10000 15000 1 2 3 carat price Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 43/121

ggplot Function More important than qplot to access full functionality of ggplot2 Main arguments data set, usually a data.frame aesthetic mappings provided by aes function General ggplot syntax ggplot(data, aes(...)) + geom_*() +... + stat_*() +... Layer specifications geom_*(mapping, data,..., geom, position) stat_*(mapping, data,..., stat, position) Additional components scales coordinates facet aes() mappings can be passed on to all components (ggplot, geom_*, etc.). Effects are global when passed on to ggplot() and local for other components. x, y color: grouping vector (factor) group: grouping vector (factor) Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 44/121

Changing Plotting Themes with ggplot Theme settings can be accessed with theme_get() Their settings can be changed with theme() Some examples Change background color to white... + theme(panel.background=element_rect(fill = "white", colour = "black")) Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 45/121

Storing ggplot Specifications Plots and layers can be stored in variables > p <- ggplot(dsmall, aes(carat, price)) + geom_point() > p # or print(p) Returns information about data and aesthetic mappings followed by each layer > summary(p) Prints dots with different sizes and colors > bestfit <- geom_smooth(methodw = "lm", se = F, color = alpha("steelblue", 0.5), > p + bestfit # Plot with custom regression line Syntax to pass on other data sets > p %+% diamonds[sample(nrow(diamonds), 100),] Saves plot stored in variable p to file > ggsave(p, file="myplot.pdf") Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 46/121

ggplot: Scatter Plot > p <- ggplot(dsmall, aes(carat, price, color=color)) + + geom_point(size=4) > print(p) 0 5000 10000 15000 1 2 3 carat price color D E F G H I J Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 47/121

ggplot: Scatter Plot with Regression Line > p <- ggplot(dsmall, aes(carat, price)) + geom_point() + + geom_smooth(method="lm", se=false) + + theme(panel.background=element_rect(fill = "white", colour = "black > print(p) 0 5000 10000 15000 20000 25000 1 2 3 carat price Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 48/121

ggplot: Scatter Plot with Several Regression Lines > p <- ggplot(dsmall, aes(carat, price, group=color)) + + geom_point(aes(color=color), size=2) + + geom_smooth(aes(color=color), method = "lm", se=false) > print(p) 0 5000 10000 15000 20000 1 2 3 carat price color D E F G H I J Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 49/121

ggplot: Scatter Plot with Local Regression Curve (loess) > p <- ggplot(dsmall, aes(carat, price)) + geom_point() + geom_smooth() > print(p) # Setting 'se=false' removes error shade 0 5000 10000 15000 1 2 3 carat price Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 50/121

ggplot: Line Plot > p <- ggplot(iris, aes(petal.length, Petal.Width, group=species, + color=species)) + geom_line() > print(p) 2.5 2.0 Petal.Width 1.5 Species setosa versicolor virginica 1.0 0.5 0.0 2 4 6 Petal.Length Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 51/121

ggplot: Faceting > p <- ggplot(iris, aes(sepal.length, Sepal.Width)) + + geom_line(aes(color=species), size=1) + + facet_wrap(~species, ncol=1) > print(p) 4.5 setosa 4.0 3.5 3.0 2.5 2.0 4.5 versicolor Sepal.Width 4.0 3.5 3.0 2.5 Species setosa versicolor virginica 2.0 4.5 virginica 4.0 3.5 3.0 2.5 2.0 5 6 7 8 Sepal.Length Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 52/121

Exercise 3: Scatter Plots Task 1 Generate scatter plot for first two columns in iris data frame and color dots by its Species column. Task 2 Use the xlim, ylim functionss to set limits on the x- and y-axes so that all data points are restricted to the left bottom quadrant of the plot. Task 3 Generate corresponding line plot with faceting show individual data sets in saparate plots. Structure of iris data set: > class(iris) [1] "data.frame" > iris[1:4,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa > table(iris$species) setosa versicolor virginica 50 50 50 Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 53/121

ggplot: Bar Plots Sample Set: the following transforms the iris data set into a ggplot2-friendly format. Calculate mean values for aggregates given by Species column in iris data set > iris_mean <- aggregate(iris[,1:4], by=list(species=iris$species), FUN=mean) Calculate standard deviations for aggregates given by Species column in iris data set > iris_sd <- aggregate(iris[,1:4], by=list(species=iris$species), FUN=sd) Convert iris_mean with melt > library(reshape2) # Defines melt function > df_mean <- melt(iris_mean, id.vars=c("species"), variable.name = "Samples", value.n Convert iris_sd with melt > df_sd <- melt(iris_sd, id.vars=c("species"), variable.name = "Samples", value.name= Define standard deviation limits > limits <- aes(ymax = df_mean[,"values"] + df_sd[,"values"], ymin=df_mean[,"values"] Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 54/121

ggplot: Bar Plot > p <- ggplot(df_mean, aes(samples, Values, fill = Species)) + + geom_bar(position="dodge", stat="identity") > print(p) 6 Values 4 Species setosa versicolor virginica 2 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Samples Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 55/121

ggplot: Bar Plot Sideways > p <- ggplot(df_mean, aes(samples, Values, fill = Species)) + + geom_bar(position="dodge", stat="identity") + coord_flip() + + theme(axis.text.y=theme_text(angle=0, hjust=1)) > print(p) Petal.Width Petal.Length Samples Species setosa versicolor virginica Sepal.Width Sepal.Length 0 2 4 6 Values Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 56/121

ggplot: Bar Plot with Faceting > p <- ggplot(df_mean, aes(samples, Values)) + geom_bar(aes(fill = Species), stat + facet_wrap(~species, ncol=1) > print(p) 6 setosa 4 2 0 versicolor Values 6 4 2 Species setosa versicolor virginica 0 virginica 6 4 2 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Samples Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 57/121

ggplot: Bar Plot with Error Bars > p <- ggplot(df_mean, aes(samples, Values, fill = Species)) + + geom_bar(position="dodge", stat="identity") + geom_errorbar(limits, > print(p) 6 Values 4 Species setosa versicolor virginica 2 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Samples Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 58/121

ggplot: Changing Color Settings > library(rcolorbrewer) > # display.brewer.all() > p <- ggplot(df_mean, aes(samples, Values, fill=species, color=species)) + + geom_bar(position="dodge", stat="identity") + geom_errorbar(limits, position="dodge") + + scale_fill_brewer(palette="blues") + scale_color_brewer(palette = "Greys") > print(p) 6 Values 4 Species setosa versicolor virginica 2 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Samples Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 59/121

ggplot: Using Standard Colors > p <- ggplot(df_mean, aes(samples, Values, fill=species, color=species)) + + geom_bar(position="dodge", stat="identity") + geom_errorbar(limits, position="dodge") + + scale_fill_manual(values=c("red", "green3", "blue")) + + scale_color_manual(values=c("red", "green3", "blue")) > print(p) 6 Values 4 Species setosa versicolor virginica 2 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Samples Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 60/121

ggplot: Mirrored Bar Plots > df <- data.frame(group = rep(c("above", "Below"), each=10), x = rep(1:10, 2), y = c(runif(10, 0, 1), runi > p <- ggplot(df, aes(x=x, y=y, fill=group)) + + geom_bar(stat="identity", position="identity") > print(p) 0.5 y 0.0 group Above Below 0.5 1.0 2.5 5.0 7.5 10.0 x Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 61/121

Exercise 4: Bar Plots Task 1 Calculate the mean values for the Species components of the first four columns in the iris data set. Use the melt function from the reshape2 package to bring the results into the expected format for ggplot. Task 2 Generate two bar plots: one with stacked bars and one with horizontally arranged bars. Structure of iris data set: > class(iris) [1] "data.frame" > iris[1:4,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa > table(iris$species) setosa versicolor virginica 50 50 50 Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 62/121

ggplot: Data Reformatting Example for Line Plot > y <- matrix(rnorm(500), 100, 5, dimnames=list(paste("g", 1:100, sep=""), paste("sample", 1:5, sep=""))) > y <- data.frame(position=1:length(y[,1]), y) > y[1:4, ] # First rows of input format expected by melt() Position Sample1 Sample2 Sample3 Sample4 Sample5 g1 1 1.0002088 0.6850199-0.21324932 1.27195056 1.0479301 g2 2-1.2024596-1.5004962-0.01111579 0.07584497-0.7100662 g3 3 0.1023678-0.5153367 0.28564390 1.41522878 1.1084695 g4 4 1.3294248-1.2084007-0.19581898-0.42361768 1.7139697 > df <- melt(y, id.vars=c("position"), variable.name = "Samples", value.name="values") > p <- ggplot(df, aes(position, Values)) + geom_line(aes(color=samples)) + facet_wrap(~samples, ncol=1) > print(p) > ## Represent same data in box plot > ## ggplot(df, aes(samples, Values, fill=samples)) + geom_boxplot() Sample1 Values 2 0 2 4 2 0 2 4 2 0 2 4 2 0 2 4 Sample2 Sample3 Sample4 Sample5 Samples Sample1 Sample2 Sample3 Sample4 Sample5 2 0 2 Graphics and Data Visualization in R 4 Graphics Environments ggplot2 Slide 63/121

ggplot: Jitter Plots > p <- ggplot(dsmall, aes(color, price/carat)) + + geom_jitter(alpha = I(1 / 2), aes(color=color)) > print(p) 10000 price/carat 5000 color D E F G H I J D E F G H I J color Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 64/121

ggplot: Box Plots > p <- ggplot(dsmall, aes(color, price/carat, fill=color)) + geom_boxplot() > print(p) price/carat 10000 color D E F G H I 5000 J D E F G H I J color Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 65/121

ggplot: Density Plot with Line Coloring > p <- ggplot(dsmall, aes(carat)) + geom_density(aes(color = color)) > print(p) 1.5 density 1.0 0.5 color D E F G H I J 0.0 1 2 3 carat Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 66/121

ggplot: Density Plot with Area Coloring > p <- ggplot(dsmall, aes(carat)) + geom_density(aes(fill = color)) > print(p) 1.5 density 1.0 0.5 color D E F G H I J 0.0 1 2 3 carat Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 67/121

ggplot: Histograms > p <- ggplot(iris, aes(x=sepal.width)) + geom_histogram(aes(y =..density.., + fill =..count..), binwidth=0.2) + geom_density() > print(p) 1.2 0.8 count density 30 20 10 0.4 0 0.0 2 3 4 Sepal.Width Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 68/121

ggplot: Pie Chart > df <- data.frame(variable=rep(c("cat", "mouse", "dog", "bird", "fly")), + value=c(1,3,3,4,2)) > p <- ggplot(df, aes(x = "", y = value, fill = variable)) + + geom_bar(width = 1, stat="identity") + + coord_polar("y", start=pi / 3) + ggtitle("pie Chart") > print(p) Pie Chart 10.0 12.5 0.0 "" 7.5 variable bird cat dog fly mouse 2.5 5.0 value Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 69/121

ggplot: Wind Rose Pie Chart > p <- ggplot(df, aes(x = variable, y = value, fill = variable)) + + geom_bar(width = 1, stat="identity") + coord_polar("y", start=pi / 3) + + ggtitle("pie Chart") > print(p) Pie Chart mouse 3 variable fly dog cat bird 0/4 variable bird cat dog fly mouse 2 1 value Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 70/121

ggplot: Arranging Graphics on One Page > library(grid) > a <- ggplot(dsmall, aes(color, price/carat)) + geom_jitter(size=4, alpha = I(1 / 1.5), aes(color=color)) > b <- ggplot(dsmall, aes(color, price/carat, color=color)) + geom_boxplot() > c <- ggplot(dsmall, aes(color, price/carat, fill=color)) + geom_boxplot() + theme(legend.position = "none > grid.newpage() # Open a new page on grid device > pushviewport(viewport(layout = grid.layout(2, 2))) # Assign to device viewport with 2 by 2 grid layout > print(a, vp = viewport(layout.pos.row = 1, layout.pos.col = 1:2)) > print(b, vp = viewport(layout.pos.row = 2, layout.pos.col = 1)) > print(c, vp = viewport(layout.pos.row = 2, layout.pos.col = 2, width=0.3, height=0.3, x=0.8, y=0.8)) Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 71/121

ggplot: Arranging Graphics on One Page 10000 price/carat 8000 6000 4000 2000 color D E F G H I J D E F G H I J color 10000 10000 color price/carat 8000 6000 4000 D E F G H I price/carat 8000 6000 4000 2000 J 2000 D E F G H I J color D E F G H I J color Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 72/121

ggplot: Inserting Graphics into Plots > # pdf("insert.pdf") > print(a) > print(b, vp=viewport(width=0.3, height=0.3, x=0.8, y=0.8)) > # dev.off() 10000 8000 price/carat color 10000 D 8000 E 6000 F G 4000 H 2000 I DEFGHIJ J color price/carat 6000 color D E F G H I J 4000 2000 D E F G H I J color Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 73/121

Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Specialty Graphics Slide 74/121

Venn Diagrams (Code) > source("http://faculty.ucr.edu/~tgirke/documents/r_biocond/my_r_scripts/overlapper.r") > setlist5 <- list(a=sample(letters, 18), B=sample(letters, 16), C=sample(letters, 20), D=sample(letters, 2 > OLlist5 <- overlapper(setlist=setlist5, sep="_", type="vennsets") > counts <- sapply(ollist5$venn_list, length) > # pdf("venn.pdf") > vennplot(counts=counts, ccol=c(rep(1,30),2), lcex=1.5, ccex=c(rep(1.5,5), rep(0.6,25),1.5)) > # dev.off() Graphics and Data Visualization in R Specialty Graphics Slide 75/121

Venn Diagram (Plot) Venn Diagram Unique objects: All = 26; S1 = 18; S2 = 16; S3 = 20; S4 = 22; S5 = 18 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 2 2 0 1 2 0 1 2 2 1 0 3 2 5 A B C D E Figure: Venn Diagram Graphics and Data Visualization in R Specialty Graphics Slide 76/121

Compound Depictions with ChemmineR > library(chemminer) > data(sdfsample) > plot(sdfsample[1], print=false) CMP1 N O H N O O N O O O N H Graphics and Data Visualization in R Specialty Graphics Slide 77/121

Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Genome Graphics Slide 78/121

Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Genome Graphics ggbio Slide 79/121

ggbio: A Programmable Genome Browser A genome browser is a visulalization tool for plotting different types of genomic data in separate tracks along chromosomes. The ggbio package (Yin et al., 2012) facilitates plotting of complex genome data objects, such as read alignments (SAM/BAM), genomic context/annotation information (gff/txdb), variant calls (VCF/BCF), and more. To easily compare these data sets, it extends the faceting facility of ggplot2 to genome browser-like tracks. Most of the core object types for handling genomic data with R/Bioconductor are supported: GRanges, GAlignments, VCF, etc. For more details, see Table 1.1 of the ggbio vignette Link. ggbio s convenience plotting function is autoplot. For more customizable plots, one can use the generic ggplot function. Apart from the standard ggplot2 plotting components, ggbio defines serval new components useful for genomic data visualization. A detailed list is given in Table 1.2 of the vignette Link. Useful web sites: ggbio manual Link ggbio functions Link autoplot demo Link Graphics and Data Visualization in R Genome Graphics ggbio Slide 80/121

Graphics and Data Visualization in R Genome Graphics ggbio Slide 81/121 Tracks: Aligning Plots Along Chromosomes > library(ggbio) > df1 <- data.frame(time = 1:100, score = sin((1:100)/20)*10) > p1 <- qplot(data = df1, x = time, y = score, geom = "line") > df2 <- data.frame(time = 30:120, score = sin((30:120)/20)*10, value = rnorm(120-30 +1)) > p2 <- ggplot(data = df2, aes(x = time, y = score)) + geom_line() + geom_point(size = 2, aes(color = value > tracks(time1 = p1, time2 = p2) + xlim(1, 40) + theme_tracks_sunset() 10 5 time1 score 0 5 10 10 5 value 2 time2 score 0 1 0 1 5 2 3 10

Plotting Genomic Ranges GRanges objects are essential for storing alignment or annotation ranges in R/Bioconductor. The following creates a sample GRanges object and plots its content. > library(genomicranges) > set.seed(1); N <- 100; gr <- GRanges(seqnames = sample(c("chr1", "chr2", "chr3"), size = N, replace = TRU > autoplot(gr, aes(color = strand, fill = strand), facets = strand ~ seqnames) chr1 chr2 chr3 + strand + 0 bp100 bp200 bp300 bp 0 bp100 bp200 bp300 bp 0 bp100 bp200 bp300 bp Graphics and Data Visualization in R Genome Graphics ggbio Slide 82/121

Plotting Coverage Instead of Ranges > autoplot(gr, aes(color = strand, fill = strand), facets = strand ~ seqnames, stat = "coverage") chr1 chr2 chr3 7.5 Coverage 5.0 2.5 0.0 7.5 5.0 + strand + 2.5 0.0 0 bp100 bp200 bp300 bp 0 bp100 bp200 bp300 bp 0 bp100 bp200 bp300 bp Graphics and Data Visualization in R Genome Graphics ggbio Slide 83/121

Mirrored Coverage Plot > pos <- sapply(coverage(gr[strand(gr)=="+"]), as.numeric) > pos <- data.frame(chr=rep(names(pos), sapply(pos, length)), Strand=rep("+", length(unlist(pos))), Positio > neg <- sapply(coverage(gr[strand(gr)=="-"]), as.numeric) > neg <- data.frame(chr=rep(names(neg), sapply(neg, length)), Strand=rep("-", length(unlist(neg))), Positio > covdf <- rbind(pos, neg) > p <- ggplot(covdf, aes(position, Coverage, fill=strand)) + + geom_bar(stat="identity", position="identity") + facet_wrap(~chr) > p chr1 chr2 chr3 5 Coverage 0 Strand + 5 0 100 200 300 0 100 200 300 0 100 200 300 Position Graphics and Data Visualization in R Genome Graphics ggbio Slide 84/121

Circular Layout > autoplot(gr, layout = "circle", aes(fill = seqnames)) seqnames chr1 chr2 chr3 Graphics and Data Visualization in R Genome Graphics ggbio Slide 85/121

More Complex Circular Example > seqlengths(gr) <- c(400, 500, 700) > values(gr)$to.gr <- gr[sample(1:length(gr), size = length(gr))] > idx <- sample(1:length(gr), size = 50) > gr <- gr[idx] > ggplot() + layout_circle(gr, geom = "ideo", fill = "gray70", radius = 7, trackwidth = 3) + + layout_circle(gr, geom = "bar", radius = 10, trackwidth = 4, + aes(fill = score, y = score)) + + layout_circle(gr, geom = "point", color = "red", radius = 14, + trackwidth = 3, grid = TRUE, aes(y = score)) + + layout_circle(gr, geom = "link", linked.to = "to.gr", radius = 6, trackwidth = 1) score 100 50 Graphics and Data Visualization in R Genome Graphics ggbio Slide 86/121

Viewing Alignments and Variants To make the following example work, please download and unpack this data archive Link containing GFF, BAM and VCF sample files. > library(rtracklayer); library(genomicfeatures); library(rsamtools); library(variantannotation) > ga <- readgalignmentsfrombam("./data/srr064167.fastq.bam", use.names=true, param=scanbamparam(which=grang > p1 <- autoplot(ga, geom = "rect") > p2 <- autoplot(ga, geom = "line", stat = "coverage") > vcf <- readvcf(file="data/varianttools_gnsap.vcf", genome="ath1") > p3 <- autoplot(vcf[seqnames(vcf)=="chr5"], type = "fixed") + xlim(4000, 8000) + theme(legend.position = " > txdb <- maketranscriptdbfromgff(file="./data/tair10_gff3_trunc.gff", format="gff3") > p4 <- autoplot(txdb, which=granges("chr5", IRanges(4000, 8000)), names.expr = "gene_id") > tracks(reads=p1, Coverage=p2, Variant=p3, Transcripts=p4, heights = c(0.3, 0.2, 0.1, 0.35)) + ylab("") Reads Coverage Variant 75 50 25 0 C T ALTREF Transcripts AT5G01015 AT5G01015 AT5G01010 AT5G01010 AT5G01010 AT5G01010 AT5G01020 3 kb 4 kb 5 kb 6 kb 7 kb 8 kb Genome Graphics Graphics and Data Visualization in R ggbio Slide 87/121

Additional Sample Plots autoplot demo Link Graphics and Data Visualization in R Genome Graphics ggbio Slide 88/121

Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Genome Graphics Additional Genome Graphics Slide 89/121

Additional Packages for Visualizing Genome Data Gviz Link RCircos (Zhang et al., 2013) Genome Graphs genoplotr Link Link Link Graphics and Data Visualization in R Genome Graphics Additional Genome Graphics Slide 90/121

Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Clustering Slide 91/121

Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Clustering Background Slide 92/121

What is Clustering? Clustering is the classification/partitioning of data objects into similarity groups (clusters) according to a defined distance measure. It is used in many fields, such as machine learning, data mining, pattern recognition, image analysis, genomics, systems biology, etc. Machine learning typically regards data clustering as a form of unsupervised learning. Graphics and Data Visualization in R Clustering Background Slide 93/121

Why Clustering and Data Mining in R? Efficient data structures and functions for clustering. Efficient environment for algorithm prototyping and benchmarking. Comprehensive set of clustering and machine learning libraries. Standard for data analysis in many areas. Overview: Clustering Task View on CRAN Graphics and Data Visualization in R Clustering Background Slide 94/121

Data Transformations Choice depends on data set! Center & standardize 1 Center: subtract vector mean from each value 2 Standardize: devide by standard deviation Mean = 0 and STDEV = 1 Center & scale with the scale() fuction 1 Center: subtract vector mean from each value 2 Scale: divide centered vector by their root mean square (rms) Log transformation x rms = 1 n 1 n i=1 Mean = 0 and STDEV = 1 Rank transformation: replace measured values by ranks No transformation x i 2 Graphics and Data Visualization in R Clustering Background Slide 95/121

Distance Methods List of most common ones! Euclidean distance for two profiles X and Y d(x, Y ) = n (x i y i ) 2 Disadvantages: not scale invariant, not for negative correlations i=1 Maximum, Manhattan, Canberra, binary, Minowski,... Correlation-based distance: 1 r Pearson correlation coefficient (PCC) r = n n i=1 x iy i n i=1 x n i i=1 y i ( n i=1 x i 2 ( n i=1 x i) 2 )( n i=1 y i 2 ( n i=1 y i) 2 ) Disadvantage: outlier sensitive Spearman correlation coefficient (SCC) Same calculation as PCC but with ranked values! Graphics and Data Visualization in R Clustering Background Slide 96/121

There Are Many more Distance Measures If the distances among items are quantifiable, then clustering is possible. Choose the most accurate and meaningful distance measure for a given field of application. If uncertain then choose several distance measures and compare the results. Graphics and Data Visualization in R Clustering Background Slide 97/121

Cluster Linkage Single Linkage Complete Linkage Average Linkage Graphics and Data Visualization in R Clustering Background Slide 98/121

Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 99/121

Hierarchical Clustering Steps 1 Identify clusters (items) with closest distance 2 Join them to new clusters 3 Compute distance between clusters (items) 4 Return to step 1 Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 100/121

Hierarchical Clustering Agglomerative Approach (a) 0.1 g1 g2 g3 g4 g5 (b) 0.4 0.1 g1 g2 g3 g4 g5 (c) 0.6 0.4 0.5 0.1 g1 g2 g3 g4 g5 Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 101/121

Hierarchical Clustering Result with Heatmap Count 0 1 2 3 Color Key and Histogram 1 0 1 Row Z Score g10 g7 g9 g1 g5 g2 g6 g8 g4 g3 t4 t5 t2 t1 t3 A heatmap is a color coded table. To visually identify patterns, the rows and columns of a heatmap are often sorted by hierarchical clustering trees. In case of gene expression data, the row tree usually represents the genes, the column tree the treatments and the colors in the heat table represent the intensities or ratios of the underlying gene expression data set. Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 102/121

Hierarchical Clustering Approaches in R 1 Agglomerative approach (bottom-up) hclust() and agnes() 2 Divisive approach (top-down) diana() Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 103/121

Tree Cutting to Obtain Discrete Clusters 1 Node height in tree 2 Number of clusters 3 Search tree nodes by distance cutoff Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 104/121

Example: hclust and heatmap.2 > library(gplots) > y <- matrix(rnorm(500), 100, 5, dimnames=list(paste("g", 1:100, sep=""), paste("t", 1:5, sep=""))) > heatmap.2(y) # Shortcut to final result Color Key and Histogram 2 0 2 Value g39 g81 g31 g22 g66 g41 g5 g77 g58 g19 g96 g32 g71 g20 g60 g56 g89 g92 g73 g47 g61 g38 g12 g86 g33 g25 g42 g17 g28 g90 g51 g91 g34 g69 g16 g94 g46 g75 g36 g37 g68 g57 g53 g62 g59 g85 g100 g88 g26 g70 g24 g93 g67 g15 g30 g44 g54 g23 g83 g65 g21 g8 g52 g1 g78 g29 g63 g49 g95 g55 g97 g18 g80 g50 g7 g2 g76 g10 g11 g3 g40 g6 g99 g98 g43 g79 g4 g64 g84 g74 g35 g27 g82 g48 g14 g87 g45 g13 g72 t1 t2 t3 t5 t4 Count 0 40 Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 105/121

Example: Stepwise Approach with Tree Cutting > ## Row- and column-wise clustering > hr <- hclust(as.dist(1-cor(t(y), method="pearson")), method="complete") > hc <- hclust(as.dist(1-cor(y, method="spearman")), method="complete") > ## Tree cutting > mycl <- cutree(hr, h=max(hr$height)/1.5); mycolhc <- rainbow(length(unique(mycl)), start=0.1, end=0.9); mycolhc > ## Plot heatmap > mycol <- colorpanel(40, "darkblue", "yellow", "white") # or try redgreen(75) > heatmap.2(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=mycol, scale="row", density.info="none", trace Color Key 1.5 0 1 Row Z Score g79 g64 g84 g20 g60 g4 g45 g82 g27 g35 g74 g14 g98 g6 g99 g8 g21 g93 g65 g24 g48 g100 g63 g87 g49 g85 g29 g59 g80 g88 g26 g70 g30 g11 g67 g52 g9 g95 g40 g1 g78 g3 g55 g61 g47 g73 g15 g81 g32 g71 g7 g41 g2 g22 g10 g76 g97 g18 g50 g53 g19 g96 g5 g66 g77 g25 g17 g42 g31 g58 g12 g38 g62 g28 g33 g68 g86 g46 g72 g92 g56 g89 g44 g83 g13 g54 g23 g39 g34 g94 g75 g43 g16 g69 g36 g37 g91 g57 g51 g90 t1 t2 t5 t3 t4 Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 106/121

Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 107/121

K-Means Clustering 1 Choose the number of k clusters 2 Randomly assign items to the k clusters 3 Calculate new centroid for each of the k clusters 4 Calculate the distance of all items to the k centroids 5 Assign items to closest centroid 6 Repeat until clusters assignments are stable Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 108/121

K-Means (a) X X X (b) X X X (c) X X X Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 109/121

Example: Clustering with kmeans Function > km <- kmeans(t(scale(t(y))), 3) > km$cluster g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 g16 g17 2 1 2 3 1 3 1 2 2 1 3 1 3 2 3 3 1 g45 g46 g47 g48 g49 g50 g51 g52 g53 g54 g55 g56 g57 g58 g59 g60 g61 3 3 3 2 2 2 1 2 1 3 2 3 1 1 2 3 3 g89 g90 g91 g92 g93 g94 g95 g96 g97 g98 g99 g100 3 1 1 3 2 1 2 1 1 3 3 2 Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 110/121

Fuzzy C-Means Clustering 1 In contrast to strict (hard) clustering approaches, fuzzy (soft) clustering methods allow multiple cluster memberships of the clustered items. 2 This is commonly achieved by assigning to each item a weight of belonging to each cluster. 3 Thus, items on the edge of a cluster, may be in the cluster to a lesser degree than items in the center of a cluster. 4 Typically, each item has as many coefficients (weights) as there are clusters that sum up for each item to one. Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 111/121

Example: Fuzzy Clustering with fanny > library(cluster) # Loads the cluster library. > fannyy <- fanny(y, k=4, metric = "euclidean", memb.exp = 1.2) > round(fannyy$membership, 2)[1:4,] [,1] [,2] [,3] [,4] g1 0.82 0.04 0.10 0.05 g2 0.82 0.05 0.12 0.01 g3 0.98 0.01 0.01 0.01 g4 0.03 0.82 0.03 0.12 > fannyy$clustering g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 g16 g17 1 1 1 2 3 4 1 1 1 1 1 3 4 4 2 4 3 g45 g46 g47 g48 g49 g50 g51 g52 g53 g54 g55 g56 g57 g58 g59 g60 g61 2 4 2 4 1 1 3 1 3 4 1 2 3 3 3 2 2 g89 g90 g91 g92 g93 g94 g95 g96 g97 g98 g99 g100 2 3 3 2 4 3 1 2 1 4 4 4 Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 112/121

Principal Component Analysis (PCA) Principal components analysis (PCA) is a data reduction technique that allows to simplify multidimensional data sets to 2 or 3 dimensions for plotting purposes and visual variance analysis. Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 113/121

Basic PCA Steps Center (and standardize) data First principal component axis Accross centroid of data cloud Distance of each point to that line is minimized, so that it crosses the maximum variation of the data cloud Second principal component axis Orthogonal to first principal component Along maximum variation in the data 1 st PCA axis becomes x-axis and 2 nd PCA axis y-axis Continue process until the necessary number of principal components is obtained Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 114/121

PCA on Two-Dimensional Data Set 2nd 2nd 1st 1st Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 115/121

Identifies the Amount of Variability between Components Example Principal Component 1 st 2 nd 3 rd Other Proportion of Variance 62% 34% 3% rest 1 st and 2 nd principal components explain 96% of variance. Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 116/121

Example: PCA > pca <- prcomp(y, scale=t) > summary(pca) # Prints variance summary for all principal components Importance of components: PC1 PC2 PC3 PC4 PC5 Standard deviation 1.0996 1.0505 1.0247 0.9219 0.8873 Proportion of Variance 0.2418 0.2207 0.2100 0.1700 0.1575 Cumulative Proportion 0.2418 0.4626 0.6725 0.8425 1.0000 > plot(pca$x, pch=20, col="blue", type="n") # To plot dots, drop type="n" > text(pca$x, rownames(pca$x), cex=0.8) g45 PC2 2 1 0 1 2 g82 g27 g79 g48 g72 g64 g4 g13 g87 g40 g35 g84 g75 g74 g11g9 g14 g47 g61 g60 g53 g3 g92 g55 g23g36 g57 g34 g73 g10 g20 g62 g46 g70 g37 g99g83 g96 g89 g68 g16 g6g49 g18 g43 g21g8 g71 g50 g15 g63 g65 g32 g19 g2 g80 g67 g98 g100 g76 g30 g94 g69g95 g54 g29 g26 g7 g56 g77 g28 g44 g17 g42 g78 g5 g97 g25 g1 g91 g88 g66 g58 g12 g51 g81 g33 g85 g39 g52 g59 g41 g90 g38 g86 g93 g24 g22 g31 2 1 0 1 2 3 PC1 Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 117/121

Multidimensional Scaling (MDS) Alternative dimensionality reduction approach Represents distances in 2D or 3D space Starts from distance matrix (PCA uses data points) Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 118/121

Example: MDS with cmdscale The following example performs MDS analysis on the geographic distances among European cities. > loc <- cmdscale(eurodist) > plot(loc[,1], -loc[,2], type="n", xlab="", ylab="", main="cmdscale(eurodist)") > text(loc[,1], -loc[,2], rownames(loc), cex=0.8) cmdscale(eurodist) Stockholm 1000 0 1000 Copenhagen Hamburg Hook of Holland Calais Cherbourg Brussels Cologne Paris Lisbon Munich Lyons Madrid Geneva Vienna Marseilles Milan Barcelona Gibraltar Rome Athens 2000 1000 0 1000 2000 Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 119/121

Biclustering Finds in matrix subgroups of rows and columns which are as similar as possible to each other and as different as possible to the remaining data points. Unclustered Clustered Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 120/121

Bibliography I Yin, T., Cook, D., Lawrence, M., Aug 2012. ggbio: an R package for extending the grammar of graphics for genomic data. Genome Biol 13 (8). URL http://www.hubmed.org/display.cgi?uids=22937822 Zhang, H., Meltzer, P., Davis, S., 2013. RCircos: an R package for Circos 2D track plots. BMC Bioinformatics 14, 244 244. URL http://www.hubmed.org/display.cgi?uids=23937229 Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 121/121