Graphics and Data Visualization in R
|
|
|
- Grace Lynch
- 10 years ago
- Views:
Transcription
1 Graphics and Data Visualization in R Overview Thomas Girke December 13, 2013 Graphics and Data Visualization in R Slide 1/121
2 Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Slide 2/121
3 Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Overview Slide 3/121
4 Graphics in R Powerful environment for visualizing scientific data Integrated graphics and statistics infrastructure Publication quality graphics Fully programmable Highly reproducible Full L A TEX Link & Sweave Link support Vast number of R packages with graphics utilities Graphics and Data Visualization in R Overview Slide 4/121
5 Documentation on Graphics in R General Graphics Task Page Link R Graph Gallery Link R Graphical Manual Link Paul Murrell s book R (Grid) Graphics Link Interactive graphics rggobi (GGobi) Link iplots Link Open GL (rgl) Link Graphics and Data Visualization in R Overview Slide 5/121
6 Graphics Environments Viewing and saving graphics in R On-screen graphics postscript, pdf, svg jpeg/png/wmf/tiff/... Four major graphic environments Low-level infrastructure R Base Graphics (low- and high-level) grid: Manual Link, Book Link High-level infrastructure lattice: Manual Link, Intro Link, Book Link ggplot2: Manual Link, Intro Link, Book Link Graphics and Data Visualization in R Overview Slide 6/121
7 Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Graphics Environments Slide 7/121
8 Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 8/121
9 Base Graphics: Overview Important high-level plotting functions plot: generic x-y plotting barplot: bar plots boxplot: box-and-whisker plot hist: histograms pie: pie charts dotchart: cleveland dot plots image, heatmap, contour, persp: functions to generate image-like plots qqnorm, qqline, qqplot: distribution comparison plots pairs, coplot: display of multivariant data Help on these functions?myfct?plot?par Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 9/121
10 Base Graphics: Preferred Input Data Objects Matrices and data frames Vectors Named vectors Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 10/121
11 Scatter Plot: very basic Sample data set for subsequent plots > set.seed(1410) > y <- matrix(runif(30), ncol=3, dimnames=list(letters[1:10], LETTERS[1:3])) > plot(y[,1], y[,2]) y[, 2] y[, 1] Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 11/121
12 Scatter Plot: all pairs > pairs(y) A B C Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 12/121
13 Scatter Plot: with labels > plot(y[,1], y[,2], pch=20, col="red", main="symbols and Labels") > text(y[,1]+0.03, y[,2], rownames(y)) Symbols and Labels y[, 2] j a f h b e g d i c y[, 1] Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 13/121
14 Scatter Plots: more examples Print instead of symbols the row names > plot(y[,1], y[,2], type="n", main="plot of Labels") > text(y[,1], y[,2], rownames(y)) Usage of important plotting parameters > grid(5, 5, lwd = 2) > op <- par(mar=c(8,8,8,8), bg="lightblue") > plot(y[,1], y[,2], type="p", col="red", cex.lab=1.2, cex.axis=1.2, + cex.main=1.2, cex.sub=1, lwd=4, pch=20, xlab="x label", + ylab="y label", main="my Main", sub="my Sub") > par(op) Important arguments mar: specifies the margin sizes around the plotting area in order: c(bottom, left, top, right) col: color of symbols pch: type of symbols, samples: example(points) lwd: size of symbols cex.*: control font sizes For details see?par Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 14/121
15 Scatter Plots: more examples Add a regression line to a plot > plot(y[,1], y[,2]) > myline <- lm(y[,2]~y[,1]); abline(myline, lwd=2) > summary(myline) Same plot as above, but on log scale > plot(y[,1], y[,2], log="xy") Add a mathematical expression to a plot > plot(y[,1], y[,2]); text(y[1,1], y[1,2], > expression(sum(frac(1,sqrt(x^2*pi)))), cex=1.3) Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 15/121
16 Exercise 1: Scatter Plots Task 1 Generate scatter plot for first two columns in iris data frame and color dots by its Species column. Task 2 Use the xlim/ylim arguments to set limits on the x- and y-axes so that all data points are restricted to the left bottom quadrant of the plot. Structure of iris data set: > class(iris) [1] "data.frame" > iris[1:4,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa setosa > table(iris$species) setosa versicolor virginica Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 16/121
17 Line Plot: Single Data Set > plot(y[,1], type="l", lwd=2, col="blue") y[, 1] Index Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 17/121
18 Line Plots: Many Data Sets > split.screen(c(1,1)); [1] 1 > plot(y[,1], ylim=c(0,1), xlab="measurement", ylab="intensity", type="l", lwd=2, col=1) > for(i in 2:length(y[1,])) { + screen(1, new=false) + plot(y[,i], ylim=c(0,1), type="l", lwd=2, col=i, xaxt="n", yaxt="n", ylab="", + xlab="", main="", bty="n") + } > close.screen(all=true) Intensity Measurement Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 18/121
19 Bar Plot Basics > barplot(y[1:4,], ylim=c(0, max(y[1:4,])+0.3), beside=true, + legend=letters[1:4]) > text(labels=round(as.vector(as.matrix(y[1:4,])),2), x=seq(1.5, 13, by=1) + +sort(rep(c(0,1,2), 4)), y=as.vector(as.matrix(y[1:4,]))+0.04) A B C 0 a b c d 0.41 Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 19/121
20 Bar Plots with Error Bars > bar <- barplot(m <- rowmeans(y) * 10, ylim=c(0, 10)) > stdev <- sd(t(y)) > arrows(bar, m, bar, m + stdev, length=0.15, angle = 90) a b c d e f g h i j Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 20/121
21 Mirrored Bar Plots > df <- data.frame(group = rep(c("above", "Below"), each=10), x = rep(1:10, 2), y > plot(c(0,12),range(df$y),type = "n") > barplot(height = df$y[df$group == 'Above'], add = TRUE,axes = FALSE) > barplot(height = df$y[df$group == 'Below'], add = TRUE,axes = FALSE) range(df$y) c(0, 12) Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 21/121
22 Histograms > hist(y, freq=true, breaks=10) Histogram of y Frequency y Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 22/121
23 Density Plots > plot(density(y), col="red") density.default(x = y) Density N = 30 Bandwidth = Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 23/121
24 Pie Charts > pie(y[,1], col=rainbow(length(y[,1]), start=0.1, end=0.8), clockwise=true) > legend("topright", legend=row.names(y), cex=1.3, bty="n", pch=15, pt.cex=1.8, + col=rainbow(length(y[,1]), start=0.1, end=0.8), ncol=1) a b c d e f g h i j a b c d e f g h i j Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 24/121
25 Color Selection Utilities Default color palette and how to change it > palette() [1] "black" "red" "green3" "blue" "cyan" "magenta" "yellow" "gray" > palette(rainbow(5, start=0.1, end=0.2)) > palette() [1] "#FF9900" "#FFBF00" "#FFE600" "#F2FF00" "#CCFF00" > palette("default") The gray function allows to select any type of gray shades by providing values from 0 to 1 > gray(seq(0.1, 1, by= 0.2)) [1] "#1A1A1A" "#4D4D4D" "#808080" "#B3B3B3" "#E6E6E6" Color gradients with colorpanel function from gplots library > library(gplots) > colorpanel(5, "darkblue", "yellow", "white") Much more on colors in R see Earl Glynn s color chart Link Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 25/121
26 Arranging Several Plots on Single Page With par(mfrow=c(nrow,ncol)) one can define how several plots are arranged next to each other. > par(mfrow=c(2,3)); for(i in 1:6) { plot(1:10) } 1: : : Index Index Index 1: : : Index Index Index Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 26/121
27 Arranging Plots with Variable Width The layout function allows to divide the plotting device into variable numbers of rows and columns with the column-widths and the row-heights specified in the respective arguments. > nf <- layout(matrix(c(1,2,3,3), 2, 2, byrow=true), c(3,7), c(5,5), + respect=true) > # layout.show(nf) > for(i in 1:3) { barplot(1:10) } Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 27/121
28 Saving Graphics to Files After the pdf() command all graphs are redirected to file test.pdf. Works for all common formats similarly: jpeg, png, ps, tiff,... > pdf("test.pdf"); plot(1:10, 1:10); dev.off() Generates Scalable Vector Graphics (SVG) files that can be edited in vector graphics programs, such as InkScape. > svg("test.svg"); plot(1:10, 1:10); dev.off() Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 28/121
29 Exercise 2: Bar Plots Task 1 Calculate the mean values for the Species components of the first four columns in the iris data set. Organize the results in a matrix where the row names are the unique values from the iris Species column and the column names are the same as in the first four iris columns. Task 2 Generate two bar plots: one with stacked bars and one with horizontally arranged bars. Structure of iris data set: > class(iris) [1] "data.frame" > iris[1:4,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa setosa > table(iris$species) setosa versicolor virginica Graphics and Data Visualization in R Graphics Environments Base Graphics Slide 29/121
30 Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Graphics Environments Grid Graphics Slide 30/121
31 grid Graphics Environment What is grid? Low-level graphics system Highly flexible and controllable system Does not provide high-level functions Intended as development environment for custom plotting functions Pre-installed on new R distributions Documentation and Help Manual Link Book Link Graphics and Data Visualization in R Graphics Environments Grid Graphics Slide 31/121
32 Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Graphics Environments lattice Slide 32/121
33 lattice Environment What is lattice? High-level graphics system Developed by Deepayan Sarkar Implements Trellis graphics system from S-Plus Simplifies high-level plotting tasks: arranging complex graphical features Syntax similar to R s base graphics Documentation and Help Manual Link Intro Link Book Link library(help=lattice) opens a list of all functions available in the lattice package Accessing and changing global parameters:?lattice.options and?trellis.device Graphics and Data Visualization in R Graphics Environments lattice Slide 33/121
34 Scatter Plot Sample > library(lattice) > p1 <- xyplot(1:8 ~ 1:8 rep(letters[1:4], each=2), as.table=true) > plot(p1) A B :8 C D :8 Graphics and Data Visualization in R Graphics Environments lattice Slide 34/121
35 Line Plot Sample > library(lattice) > p2 <- parallelplot(~iris[1:4] Species, iris, horizontal.axis = FALSE, + layout = c(1, 3, 1)) > plot(p2) Max virginica Min versicolor Max Min Max setosa Min Sepal.Length Sepal.Width Petal.Length Petal.Width Graphics and Data Visualization in R Graphics Environments lattice Slide 35/121
36 Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 36/121
37 ggplot2 Environment What is ggplot2? High-level graphics system Implements grammar of graphics from Leland Wilkinson Link Streamlines many graphics workflows for complex plots Syntax centered around main ggplot function Simpler qplot function provides many shortcuts Documentation and Help Manual Link Intro Link Book Link Cookbook for R Link Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 37/121
38 ggplot2 Usage ggplot function accepts two arguments Data set to be plotted Aesthetic mappings provided by aes function Additional parameters such as geometric objects (e.g. points, lines, bars) are passed on by appending them with + as separator. List of available geom_* functions: Settings of plotting theme can be accessed with the command theme_get() and its settings can be changed with theme(). Preferred input data object qgplot: data.frame (support for vector, matrix,...) ggplot: data.frame Packages with convenience utilities to create expected inputs plyr reshape Link Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 38/121
39 qplot Function qplot syntax is similar to R s basic plot function Arguments: x: x-coordinates (e.g. col1) y: y-coordinates (e.g. col2) data: data frame with corresponding column names xlim, ylim: e.g. xlim=c(0,10) log: e.g. log="x" or log="xy" main: main title; see?plotmath for mathematical formula xlab, ylab: labels for the x- and y-axes color, shape, size...: many arguments accepted by plot function Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 39/121
40 qplot: Scatter Plots Create sample data > library(ggplot2) > x <- sample(1:10, 10); y <- sample(1:10, 10); cat <- rep(c("a", "B"), 5) Simple scatter plot > qplot(x, y, geom="point") Prints dots with different sizes and colors > qplot(x, y, geom="point", size=x, color=cat, + main="dot Size and Color Relative to Some Values") Drops legend > qplot(x, y, geom="point", size=x, color=cat) + + theme(legend.position = "none") Plot different shapes > qplot(x, y, geom="point", size=5, shape=cat) Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 40/121
41 qplot: Scatter Plot with qplot > p <- qplot(x, y, geom="point", size=x, color=cat, + main="dot Size and Color Relative to Some Values") + + theme(legend.position = "none") > print(p) Dot Size and Color Relative to Some Values y x Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 41/121
42 qplot: Scatter Plot with Regression Line > set.seed(1410) > dsmall <- diamonds[sample(nrow(diamonds), 1000), ] > p <- qplot(carat, price, data = dsmall, geom = c("point", "smooth"), + method = "lm") > print(p) carat price Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 42/121
43 qplot: Scatter Plot with Local Regression Curve (loess) > p <- qplot(carat, price, data=dsmall, geom=c("point", "smooth"), span=0.4) > print(p) # Setting 'se=false' removes error shade carat price Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 43/121
44 ggplot Function More important than qplot to access full functionality of ggplot2 Main arguments data set, usually a data.frame aesthetic mappings provided by aes function General ggplot syntax ggplot(data, aes(...)) + geom_*() stat_*() +... Layer specifications geom_*(mapping, data,..., geom, position) stat_*(mapping, data,..., stat, position) Additional components scales coordinates facet aes() mappings can be passed on to all components (ggplot, geom_*, etc.). Effects are global when passed on to ggplot() and local for other components. x, y color: grouping vector (factor) group: grouping vector (factor) Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 44/121
45 Changing Plotting Themes with ggplot Theme settings can be accessed with theme_get() Their settings can be changed with theme() Some examples Change background color to white... + theme(panel.background=element_rect(fill = "white", colour = "black")) Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 45/121
46 Storing ggplot Specifications Plots and layers can be stored in variables > p <- ggplot(dsmall, aes(carat, price)) + geom_point() > p # or print(p) Returns information about data and aesthetic mappings followed by each layer > summary(p) Prints dots with different sizes and colors > bestfit <- geom_smooth(methodw = "lm", se = F, color = alpha("steelblue", 0.5), > p + bestfit # Plot with custom regression line Syntax to pass on other data sets > p %+% diamonds[sample(nrow(diamonds), 100),] Saves plot stored in variable p to file > ggsave(p, file="myplot.pdf") Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 46/121
47 ggplot: Scatter Plot > p <- ggplot(dsmall, aes(carat, price, color=color)) + + geom_point(size=4) > print(p) carat price color D E F G H I J Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 47/121
48 ggplot: Scatter Plot with Regression Line > p <- ggplot(dsmall, aes(carat, price)) + geom_point() + + geom_smooth(method="lm", se=false) + + theme(panel.background=element_rect(fill = "white", colour = "black > print(p) carat price Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 48/121
49 ggplot: Scatter Plot with Several Regression Lines > p <- ggplot(dsmall, aes(carat, price, group=color)) + + geom_point(aes(color=color), size=2) + + geom_smooth(aes(color=color), method = "lm", se=false) > print(p) carat price color D E F G H I J Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 49/121
50 ggplot: Scatter Plot with Local Regression Curve (loess) > p <- ggplot(dsmall, aes(carat, price)) + geom_point() + geom_smooth() > print(p) # Setting 'se=false' removes error shade carat price Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 50/121
51 ggplot: Line Plot > p <- ggplot(iris, aes(petal.length, Petal.Width, group=species, + color=species)) + geom_line() > print(p) Petal.Width 1.5 Species setosa versicolor virginica Petal.Length Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 51/121
52 ggplot: Faceting > p <- ggplot(iris, aes(sepal.length, Sepal.Width)) + + geom_line(aes(color=species), size=1) + + facet_wrap(~species, ncol=1) > print(p) 4.5 setosa versicolor Sepal.Width Species setosa versicolor virginica virginica Sepal.Length Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 52/121
53 Exercise 3: Scatter Plots Task 1 Generate scatter plot for first two columns in iris data frame and color dots by its Species column. Task 2 Use the xlim, ylim functionss to set limits on the x- and y-axes so that all data points are restricted to the left bottom quadrant of the plot. Task 3 Generate corresponding line plot with faceting show individual data sets in saparate plots. Structure of iris data set: > class(iris) [1] "data.frame" > iris[1:4,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa setosa > table(iris$species) setosa versicolor virginica Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 53/121
54 ggplot: Bar Plots Sample Set: the following transforms the iris data set into a ggplot2-friendly format. Calculate mean values for aggregates given by Species column in iris data set > iris_mean <- aggregate(iris[,1:4], by=list(species=iris$species), FUN=mean) Calculate standard deviations for aggregates given by Species column in iris data set > iris_sd <- aggregate(iris[,1:4], by=list(species=iris$species), FUN=sd) Convert iris_mean with melt > library(reshape2) # Defines melt function > df_mean <- melt(iris_mean, id.vars=c("species"), variable.name = "Samples", value.n Convert iris_sd with melt > df_sd <- melt(iris_sd, id.vars=c("species"), variable.name = "Samples", value.name= Define standard deviation limits > limits <- aes(ymax = df_mean[,"values"] + df_sd[,"values"], ymin=df_mean[,"values"] Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 54/121
55 ggplot: Bar Plot > p <- ggplot(df_mean, aes(samples, Values, fill = Species)) + + geom_bar(position="dodge", stat="identity") > print(p) 6 Values 4 Species setosa versicolor virginica 2 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Samples Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 55/121
56 ggplot: Bar Plot Sideways > p <- ggplot(df_mean, aes(samples, Values, fill = Species)) + + geom_bar(position="dodge", stat="identity") + coord_flip() + + theme(axis.text.y=theme_text(angle=0, hjust=1)) > print(p) Petal.Width Petal.Length Samples Species setosa versicolor virginica Sepal.Width Sepal.Length Values Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 56/121
57 ggplot: Bar Plot with Faceting > p <- ggplot(df_mean, aes(samples, Values)) + geom_bar(aes(fill = Species), stat + facet_wrap(~species, ncol=1) > print(p) 6 setosa versicolor Values Species setosa versicolor virginica 0 virginica Sepal.Length Sepal.Width Petal.Length Petal.Width Samples Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 57/121
58 ggplot: Bar Plot with Error Bars > p <- ggplot(df_mean, aes(samples, Values, fill = Species)) + + geom_bar(position="dodge", stat="identity") + geom_errorbar(limits, > print(p) 6 Values 4 Species setosa versicolor virginica 2 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Samples Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 58/121
59 ggplot: Changing Color Settings > library(rcolorbrewer) > # display.brewer.all() > p <- ggplot(df_mean, aes(samples, Values, fill=species, color=species)) + + geom_bar(position="dodge", stat="identity") + geom_errorbar(limits, position="dodge") + + scale_fill_brewer(palette="blues") + scale_color_brewer(palette = "Greys") > print(p) 6 Values 4 Species setosa versicolor virginica 2 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Samples Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 59/121
60 ggplot: Using Standard Colors > p <- ggplot(df_mean, aes(samples, Values, fill=species, color=species)) + + geom_bar(position="dodge", stat="identity") + geom_errorbar(limits, position="dodge") + + scale_fill_manual(values=c("red", "green3", "blue")) + + scale_color_manual(values=c("red", "green3", "blue")) > print(p) 6 Values 4 Species setosa versicolor virginica 2 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Samples Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 60/121
61 ggplot: Mirrored Bar Plots > df <- data.frame(group = rep(c("above", "Below"), each=10), x = rep(1:10, 2), y = c(runif(10, 0, 1), runi > p <- ggplot(df, aes(x=x, y=y, fill=group)) + + geom_bar(stat="identity", position="identity") > print(p) 0.5 y 0.0 group Above Below x Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 61/121
62 Exercise 4: Bar Plots Task 1 Calculate the mean values for the Species components of the first four columns in the iris data set. Use the melt function from the reshape2 package to bring the results into the expected format for ggplot. Task 2 Generate two bar plots: one with stacked bars and one with horizontally arranged bars. Structure of iris data set: > class(iris) [1] "data.frame" > iris[1:4,] Sepal.Length Sepal.Width Petal.Length Petal.Width Species setosa setosa setosa setosa > table(iris$species) setosa versicolor virginica Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 62/121
63 ggplot: Data Reformatting Example for Line Plot > y <- matrix(rnorm(500), 100, 5, dimnames=list(paste("g", 1:100, sep=""), paste("sample", 1:5, sep=""))) > y <- data.frame(position=1:length(y[,1]), y) > y[1:4, ] # First rows of input format expected by melt() Position Sample1 Sample2 Sample3 Sample4 Sample5 g g g g > df <- melt(y, id.vars=c("position"), variable.name = "Samples", value.name="values") > p <- ggplot(df, aes(position, Values)) + geom_line(aes(color=samples)) + facet_wrap(~samples, ncol=1) > print(p) > ## Represent same data in box plot > ## ggplot(df, aes(samples, Values, fill=samples)) + geom_boxplot() Sample1 Values Sample2 Sample3 Sample4 Sample5 Samples Sample1 Sample2 Sample3 Sample4 Sample Graphics and Data Visualization in R 4 Graphics Environments ggplot2 Slide 63/121
64 ggplot: Jitter Plots > p <- ggplot(dsmall, aes(color, price/carat)) + + geom_jitter(alpha = I(1 / 2), aes(color=color)) > print(p) price/carat 5000 color D E F G H I J D E F G H I J color Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 64/121
65 ggplot: Box Plots > p <- ggplot(dsmall, aes(color, price/carat, fill=color)) + geom_boxplot() > print(p) price/carat color D E F G H I 5000 J D E F G H I J color Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 65/121
66 ggplot: Density Plot with Line Coloring > p <- ggplot(dsmall, aes(carat)) + geom_density(aes(color = color)) > print(p) 1.5 density color D E F G H I J carat Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 66/121
67 ggplot: Density Plot with Area Coloring > p <- ggplot(dsmall, aes(carat)) + geom_density(aes(fill = color)) > print(p) 1.5 density color D E F G H I J carat Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 67/121
68 ggplot: Histograms > p <- ggplot(iris, aes(x=sepal.width)) + geom_histogram(aes(y =..density.., + fill =..count..), binwidth=0.2) + geom_density() > print(p) count density Sepal.Width Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 68/121
69 ggplot: Pie Chart > df <- data.frame(variable=rep(c("cat", "mouse", "dog", "bird", "fly")), + value=c(1,3,3,4,2)) > p <- ggplot(df, aes(x = "", y = value, fill = variable)) + + geom_bar(width = 1, stat="identity") + + coord_polar("y", start=pi / 3) + ggtitle("pie Chart") > print(p) Pie Chart "" 7.5 variable bird cat dog fly mouse value Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 69/121
70 ggplot: Wind Rose Pie Chart > p <- ggplot(df, aes(x = variable, y = value, fill = variable)) + + geom_bar(width = 1, stat="identity") + coord_polar("y", start=pi / 3) + + ggtitle("pie Chart") > print(p) Pie Chart mouse 3 variable fly dog cat bird 0/4 variable bird cat dog fly mouse 2 1 value Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 70/121
71 ggplot: Arranging Graphics on One Page > library(grid) > a <- ggplot(dsmall, aes(color, price/carat)) + geom_jitter(size=4, alpha = I(1 / 1.5), aes(color=color)) > b <- ggplot(dsmall, aes(color, price/carat, color=color)) + geom_boxplot() > c <- ggplot(dsmall, aes(color, price/carat, fill=color)) + geom_boxplot() + theme(legend.position = "none > grid.newpage() # Open a new page on grid device > pushviewport(viewport(layout = grid.layout(2, 2))) # Assign to device viewport with 2 by 2 grid layout > print(a, vp = viewport(layout.pos.row = 1, layout.pos.col = 1:2)) > print(b, vp = viewport(layout.pos.row = 2, layout.pos.col = 1)) > print(c, vp = viewport(layout.pos.row = 2, layout.pos.col = 2, width=0.3, height=0.3, x=0.8, y=0.8)) Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 71/121
72 ggplot: Arranging Graphics on One Page price/carat color D E F G H I J D E F G H I J color color price/carat D E F G H I price/carat J 2000 D E F G H I J color D E F G H I J color Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 72/121
73 ggplot: Inserting Graphics into Plots > # pdf("insert.pdf") > print(a) > print(b, vp=viewport(width=0.3, height=0.3, x=0.8, y=0.8)) > # dev.off() price/carat color D 8000 E 6000 F G 4000 H 2000 I DEFGHIJ J color price/carat 6000 color D E F G H I J D E F G H I J color Graphics and Data Visualization in R Graphics Environments ggplot2 Slide 73/121
74 Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Specialty Graphics Slide 74/121
75 Venn Diagrams (Code) > source(" > setlist5 <- list(a=sample(letters, 18), B=sample(letters, 16), C=sample(letters, 20), D=sample(letters, 2 > OLlist5 <- overlapper(setlist=setlist5, sep="_", type="vennsets") > counts <- sapply(ollist5$venn_list, length) > # pdf("venn.pdf") > vennplot(counts=counts, ccol=c(rep(1,30),2), lcex=1.5, ccex=c(rep(1.5,5), rep(0.6,25),1.5)) > # dev.off() Graphics and Data Visualization in R Specialty Graphics Slide 75/121
76 Venn Diagram (Plot) Venn Diagram Unique objects: All = 26; S1 = 18; S2 = 16; S3 = 20; S4 = 22; S5 = A B C D E Figure: Venn Diagram Graphics and Data Visualization in R Specialty Graphics Slide 76/121
77 Compound Depictions with ChemmineR > library(chemminer) > data(sdfsample) > plot(sdfsample[1], print=false) CMP1 N O H N O O N O O O N H Graphics and Data Visualization in R Specialty Graphics Slide 77/121
78 Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Genome Graphics Slide 78/121
79 Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Genome Graphics ggbio Slide 79/121
80 ggbio: A Programmable Genome Browser A genome browser is a visulalization tool for plotting different types of genomic data in separate tracks along chromosomes. The ggbio package (Yin et al., 2012) facilitates plotting of complex genome data objects, such as read alignments (SAM/BAM), genomic context/annotation information (gff/txdb), variant calls (VCF/BCF), and more. To easily compare these data sets, it extends the faceting facility of ggplot2 to genome browser-like tracks. Most of the core object types for handling genomic data with R/Bioconductor are supported: GRanges, GAlignments, VCF, etc. For more details, see Table 1.1 of the ggbio vignette Link. ggbio s convenience plotting function is autoplot. For more customizable plots, one can use the generic ggplot function. Apart from the standard ggplot2 plotting components, ggbio defines serval new components useful for genomic data visualization. A detailed list is given in Table 1.2 of the vignette Link. Useful web sites: ggbio manual Link ggbio functions Link autoplot demo Link Graphics and Data Visualization in R Genome Graphics ggbio Slide 80/121
81 Graphics and Data Visualization in R Genome Graphics ggbio Slide 81/121 Tracks: Aligning Plots Along Chromosomes > library(ggbio) > df1 <- data.frame(time = 1:100, score = sin((1:100)/20)*10) > p1 <- qplot(data = df1, x = time, y = score, geom = "line") > df2 <- data.frame(time = 30:120, score = sin((30:120)/20)*10, value = rnorm( )) > p2 <- ggplot(data = df2, aes(x = time, y = score)) + geom_line() + geom_point(size = 2, aes(color = value > tracks(time1 = p1, time2 = p2) + xlim(1, 40) + theme_tracks_sunset() 10 5 time1 score value 2 time2 score
82 Plotting Genomic Ranges GRanges objects are essential for storing alignment or annotation ranges in R/Bioconductor. The following creates a sample GRanges object and plots its content. > library(genomicranges) > set.seed(1); N <- 100; gr <- GRanges(seqnames = sample(c("chr1", "chr2", "chr3"), size = N, replace = TRU > autoplot(gr, aes(color = strand, fill = strand), facets = strand ~ seqnames) chr1 chr2 chr3 + strand + 0 bp100 bp200 bp300 bp 0 bp100 bp200 bp300 bp 0 bp100 bp200 bp300 bp Graphics and Data Visualization in R Genome Graphics ggbio Slide 82/121
83 Plotting Coverage Instead of Ranges > autoplot(gr, aes(color = strand, fill = strand), facets = strand ~ seqnames, stat = "coverage") chr1 chr2 chr3 7.5 Coverage strand bp100 bp200 bp300 bp 0 bp100 bp200 bp300 bp 0 bp100 bp200 bp300 bp Graphics and Data Visualization in R Genome Graphics ggbio Slide 83/121
84 Mirrored Coverage Plot > pos <- sapply(coverage(gr[strand(gr)=="+"]), as.numeric) > pos <- data.frame(chr=rep(names(pos), sapply(pos, length)), Strand=rep("+", length(unlist(pos))), Positio > neg <- sapply(coverage(gr[strand(gr)=="-"]), as.numeric) > neg <- data.frame(chr=rep(names(neg), sapply(neg, length)), Strand=rep("-", length(unlist(neg))), Positio > covdf <- rbind(pos, neg) > p <- ggplot(covdf, aes(position, Coverage, fill=strand)) + + geom_bar(stat="identity", position="identity") + facet_wrap(~chr) > p chr1 chr2 chr3 5 Coverage 0 Strand Position Graphics and Data Visualization in R Genome Graphics ggbio Slide 84/121
85 Circular Layout > autoplot(gr, layout = "circle", aes(fill = seqnames)) seqnames chr1 chr2 chr3 Graphics and Data Visualization in R Genome Graphics ggbio Slide 85/121
86 More Complex Circular Example > seqlengths(gr) <- c(400, 500, 700) > values(gr)$to.gr <- gr[sample(1:length(gr), size = length(gr))] > idx <- sample(1:length(gr), size = 50) > gr <- gr[idx] > ggplot() + layout_circle(gr, geom = "ideo", fill = "gray70", radius = 7, trackwidth = 3) + + layout_circle(gr, geom = "bar", radius = 10, trackwidth = 4, + aes(fill = score, y = score)) + + layout_circle(gr, geom = "point", color = "red", radius = 14, + trackwidth = 3, grid = TRUE, aes(y = score)) + + layout_circle(gr, geom = "link", linked.to = "to.gr", radius = 6, trackwidth = 1) score Graphics and Data Visualization in R Genome Graphics ggbio Slide 86/121
87 Viewing Alignments and Variants To make the following example work, please download and unpack this data archive Link containing GFF, BAM and VCF sample files. > library(rtracklayer); library(genomicfeatures); library(rsamtools); library(variantannotation) > ga <- readgalignmentsfrombam("./data/srr fastq.bam", use.names=true, param=scanbamparam(which=grang > p1 <- autoplot(ga, geom = "rect") > p2 <- autoplot(ga, geom = "line", stat = "coverage") > vcf <- readvcf(file="data/varianttools_gnsap.vcf", genome="ath1") > p3 <- autoplot(vcf[seqnames(vcf)=="chr5"], type = "fixed") + xlim(4000, 8000) + theme(legend.position = " > txdb <- maketranscriptdbfromgff(file="./data/tair10_gff3_trunc.gff", format="gff3") > p4 <- autoplot(txdb, which=granges("chr5", IRanges(4000, 8000)), names.expr = "gene_id") > tracks(reads=p1, Coverage=p2, Variant=p3, Transcripts=p4, heights = c(0.3, 0.2, 0.1, 0.35)) + ylab("") Reads Coverage Variant C T ALTREF Transcripts AT5G01015 AT5G01015 AT5G01010 AT5G01010 AT5G01010 AT5G01010 AT5G kb 4 kb 5 kb 6 kb 7 kb 8 kb Genome Graphics Graphics and Data Visualization in R ggbio Slide 87/121
88 Additional Sample Plots autoplot demo Link Graphics and Data Visualization in R Genome Graphics ggbio Slide 88/121
89 Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Genome Graphics Additional Genome Graphics Slide 89/121
90 Additional Packages for Visualizing Genome Data Gviz Link RCircos (Zhang et al., 2013) Genome Graphs genoplotr Link Link Link Graphics and Data Visualization in R Genome Graphics Additional Genome Graphics Slide 90/121
91 Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Clustering Slide 91/121
92 Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Clustering Background Slide 92/121
93 What is Clustering? Clustering is the classification/partitioning of data objects into similarity groups (clusters) according to a defined distance measure. It is used in many fields, such as machine learning, data mining, pattern recognition, image analysis, genomics, systems biology, etc. Machine learning typically regards data clustering as a form of unsupervised learning. Graphics and Data Visualization in R Clustering Background Slide 93/121
94 Why Clustering and Data Mining in R? Efficient data structures and functions for clustering. Efficient environment for algorithm prototyping and benchmarking. Comprehensive set of clustering and machine learning libraries. Standard for data analysis in many areas. Overview: Clustering Task View on CRAN Graphics and Data Visualization in R Clustering Background Slide 94/121
95 Data Transformations Choice depends on data set! Center & standardize 1 Center: subtract vector mean from each value 2 Standardize: devide by standard deviation Mean = 0 and STDEV = 1 Center & scale with the scale() fuction 1 Center: subtract vector mean from each value 2 Scale: divide centered vector by their root mean square (rms) Log transformation x rms = 1 n 1 n i=1 Mean = 0 and STDEV = 1 Rank transformation: replace measured values by ranks No transformation x i 2 Graphics and Data Visualization in R Clustering Background Slide 95/121
96 Distance Methods List of most common ones! Euclidean distance for two profiles X and Y d(x, Y ) = n (x i y i ) 2 Disadvantages: not scale invariant, not for negative correlations i=1 Maximum, Manhattan, Canberra, binary, Minowski,... Correlation-based distance: 1 r Pearson correlation coefficient (PCC) r = n n i=1 x iy i n i=1 x n i i=1 y i ( n i=1 x i 2 ( n i=1 x i) 2 )( n i=1 y i 2 ( n i=1 y i) 2 ) Disadvantage: outlier sensitive Spearman correlation coefficient (SCC) Same calculation as PCC but with ranked values! Graphics and Data Visualization in R Clustering Background Slide 96/121
97 There Are Many more Distance Measures If the distances among items are quantifiable, then clustering is possible. Choose the most accurate and meaningful distance measure for a given field of application. If uncertain then choose several distance measures and compare the results. Graphics and Data Visualization in R Clustering Background Slide 97/121
98 Cluster Linkage Single Linkage Complete Linkage Average Linkage Graphics and Data Visualization in R Clustering Background Slide 98/121
99 Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 99/121
100 Hierarchical Clustering Steps 1 Identify clusters (items) with closest distance 2 Join them to new clusters 3 Compute distance between clusters (items) 4 Return to step 1 Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 100/121
101 Hierarchical Clustering Agglomerative Approach (a) 0.1 g1 g2 g3 g4 g5 (b) g1 g2 g3 g4 g5 (c) g1 g2 g3 g4 g5 Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 101/121
102 Hierarchical Clustering Result with Heatmap Count Color Key and Histogram Row Z Score g10 g7 g9 g1 g5 g2 g6 g8 g4 g3 t4 t5 t2 t1 t3 A heatmap is a color coded table. To visually identify patterns, the rows and columns of a heatmap are often sorted by hierarchical clustering trees. In case of gene expression data, the row tree usually represents the genes, the column tree the treatments and the colors in the heat table represent the intensities or ratios of the underlying gene expression data set. Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 102/121
103 Hierarchical Clustering Approaches in R 1 Agglomerative approach (bottom-up) hclust() and agnes() 2 Divisive approach (top-down) diana() Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 103/121
104 Tree Cutting to Obtain Discrete Clusters 1 Node height in tree 2 Number of clusters 3 Search tree nodes by distance cutoff Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 104/121
105 Example: hclust and heatmap.2 > library(gplots) > y <- matrix(rnorm(500), 100, 5, dimnames=list(paste("g", 1:100, sep=""), paste("t", 1:5, sep=""))) > heatmap.2(y) # Shortcut to final result Color Key and Histogram Value g39 g81 g31 g22 g66 g41 g5 g77 g58 g19 g96 g32 g71 g20 g60 g56 g89 g92 g73 g47 g61 g38 g12 g86 g33 g25 g42 g17 g28 g90 g51 g91 g34 g69 g16 g94 g46 g75 g36 g37 g68 g57 g53 g62 g59 g85 g100 g88 g26 g70 g24 g93 g67 g15 g30 g44 g54 g23 g83 g65 g21 g8 g52 g1 g78 g29 g63 g49 g95 g55 g97 g18 g80 g50 g7 g2 g76 g10 g11 g3 g40 g6 g99 g98 g43 g79 g4 g64 g84 g74 g35 g27 g82 g48 g14 g87 g45 g13 g72 t1 t2 t3 t5 t4 Count 0 40 Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 105/121
106 Example: Stepwise Approach with Tree Cutting > ## Row- and column-wise clustering > hr <- hclust(as.dist(1-cor(t(y), method="pearson")), method="complete") > hc <- hclust(as.dist(1-cor(y, method="spearman")), method="complete") > ## Tree cutting > mycl <- cutree(hr, h=max(hr$height)/1.5); mycolhc <- rainbow(length(unique(mycl)), start=0.1, end=0.9); mycolhc > ## Plot heatmap > mycol <- colorpanel(40, "darkblue", "yellow", "white") # or try redgreen(75) > heatmap.2(y, Rowv=as.dendrogram(hr), Colv=as.dendrogram(hc), col=mycol, scale="row", density.info="none", trace Color Key Row Z Score g79 g64 g84 g20 g60 g4 g45 g82 g27 g35 g74 g14 g98 g6 g99 g8 g21 g93 g65 g24 g48 g100 g63 g87 g49 g85 g29 g59 g80 g88 g26 g70 g30 g11 g67 g52 g9 g95 g40 g1 g78 g3 g55 g61 g47 g73 g15 g81 g32 g71 g7 g41 g2 g22 g10 g76 g97 g18 g50 g53 g19 g96 g5 g66 g77 g25 g17 g42 g31 g58 g12 g38 g62 g28 g33 g68 g86 g46 g72 g92 g56 g89 g44 g83 g13 g54 g23 g39 g34 g94 g75 g43 g16 g69 g36 g37 g91 g57 g51 g90 t1 t2 t5 t3 t4 Graphics and Data Visualization in R Clustering Hierarchical Clustering Example Slide 106/121
107 Outline Overview Graphics Environments Base Graphics Grid Graphics lattice ggplot2 Specialty Graphics Genome Graphics ggbio Additional Genome Graphics Clustering Background Hierarchical Clustering Example Non-Hierarchical Clustering Examples Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 107/121
108 K-Means Clustering 1 Choose the number of k clusters 2 Randomly assign items to the k clusters 3 Calculate new centroid for each of the k clusters 4 Calculate the distance of all items to the k centroids 5 Assign items to closest centroid 6 Repeat until clusters assignments are stable Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 108/121
109 K-Means (a) X X X (b) X X X (c) X X X Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 109/121
110 Example: Clustering with kmeans Function > km <- kmeans(t(scale(t(y))), 3) > km$cluster g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 g16 g g45 g46 g47 g48 g49 g50 g51 g52 g53 g54 g55 g56 g57 g58 g59 g60 g g89 g90 g91 g92 g93 g94 g95 g96 g97 g98 g99 g Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 110/121
111 Fuzzy C-Means Clustering 1 In contrast to strict (hard) clustering approaches, fuzzy (soft) clustering methods allow multiple cluster memberships of the clustered items. 2 This is commonly achieved by assigning to each item a weight of belonging to each cluster. 3 Thus, items on the edge of a cluster, may be in the cluster to a lesser degree than items in the center of a cluster. 4 Typically, each item has as many coefficients (weights) as there are clusters that sum up for each item to one. Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 111/121
112 Example: Fuzzy Clustering with fanny > library(cluster) # Loads the cluster library. > fannyy <- fanny(y, k=4, metric = "euclidean", memb.exp = 1.2) > round(fannyy$membership, 2)[1:4,] [,1] [,2] [,3] [,4] g g g g > fannyy$clustering g1 g2 g3 g4 g5 g6 g7 g8 g9 g10 g11 g12 g13 g14 g15 g16 g g45 g46 g47 g48 g49 g50 g51 g52 g53 g54 g55 g56 g57 g58 g59 g60 g g89 g90 g91 g92 g93 g94 g95 g96 g97 g98 g99 g Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 112/121
113 Principal Component Analysis (PCA) Principal components analysis (PCA) is a data reduction technique that allows to simplify multidimensional data sets to 2 or 3 dimensions for plotting purposes and visual variance analysis. Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 113/121
114 Basic PCA Steps Center (and standardize) data First principal component axis Accross centroid of data cloud Distance of each point to that line is minimized, so that it crosses the maximum variation of the data cloud Second principal component axis Orthogonal to first principal component Along maximum variation in the data 1 st PCA axis becomes x-axis and 2 nd PCA axis y-axis Continue process until the necessary number of principal components is obtained Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 114/121
115 PCA on Two-Dimensional Data Set 2nd 2nd 1st 1st Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 115/121
116 Identifies the Amount of Variability between Components Example Principal Component 1 st 2 nd 3 rd Other Proportion of Variance 62% 34% 3% rest 1 st and 2 nd principal components explain 96% of variance. Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 116/121
117 Example: PCA > pca <- prcomp(y, scale=t) > summary(pca) # Prints variance summary for all principal components Importance of components: PC1 PC2 PC3 PC4 PC5 Standard deviation Proportion of Variance Cumulative Proportion > plot(pca$x, pch=20, col="blue", type="n") # To plot dots, drop type="n" > text(pca$x, rownames(pca$x), cex=0.8) g45 PC g82 g27 g79 g48 g72 g64 g4 g13 g87 g40 g35 g84 g75 g74 g11g9 g14 g47 g61 g60 g53 g3 g92 g55 g23g36 g57 g34 g73 g10 g20 g62 g46 g70 g37 g99g83 g96 g89 g68 g16 g6g49 g18 g43 g21g8 g71 g50 g15 g63 g65 g32 g19 g2 g80 g67 g98 g100 g76 g30 g94 g69g95 g54 g29 g26 g7 g56 g77 g28 g44 g17 g42 g78 g5 g97 g25 g1 g91 g88 g66 g58 g12 g51 g81 g33 g85 g39 g52 g59 g41 g90 g38 g86 g93 g24 g22 g PC1 Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 117/121
118 Multidimensional Scaling (MDS) Alternative dimensionality reduction approach Represents distances in 2D or 3D space Starts from distance matrix (PCA uses data points) Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 118/121
119 Example: MDS with cmdscale The following example performs MDS analysis on the geographic distances among European cities. > loc <- cmdscale(eurodist) > plot(loc[,1], -loc[,2], type="n", xlab="", ylab="", main="cmdscale(eurodist)") > text(loc[,1], -loc[,2], rownames(loc), cex=0.8) cmdscale(eurodist) Stockholm Copenhagen Hamburg Hook of Holland Calais Cherbourg Brussels Cologne Paris Lisbon Munich Lyons Madrid Geneva Vienna Marseilles Milan Barcelona Gibraltar Rome Athens Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 119/121
120 Biclustering Finds in matrix subgroups of rows and columns which are as similar as possible to each other and as different as possible to the remaining data points. Unclustered Clustered Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 120/121
121 Bibliography I Yin, T., Cook, D., Lawrence, M., Aug ggbio: an R package for extending the grammar of graphics for genomic data. Genome Biol 13 (8). URL Zhang, H., Meltzer, P., Davis, S., RCircos: an R package for Circos 2D track plots. BMC Bioinformatics 14, URL Graphics and Data Visualization in R Clustering Non-Hierarchical Clustering Examples Slide 121/121
Getting started with qplot
Chapter 2 Getting started with qplot 2.1 Introduction In this chapter, you will learn to make a wide variety of plots with your first ggplot2 function, qplot(), short for quick plot. qplot makes it easy
R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol
R Graphics Cookbook Winston Chang Beijing Cambridge Farnham Koln Sebastopol O'REILLY Tokyo Table of Contents Preface ix 1. R Basics 1 1.1. Installing a Package 1 1.2. Loading a Package 2 1.3. Loading a
Data Visualization in R
Data Visualization in R L. Torgo [email protected] Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto Oct, 2014 Introduction Motivation for Data Visualization Humans are outstanding at detecting
Viewing Ecological data using R graphics
Biostatistics Illustrations in Viewing Ecological data using R graphics A.B. Dufour & N. Pettorelli April 9, 2009 Presentation of the principal graphics dealing with discrete or continuous variables. Course
R Graphics II: Graphics for Exploratory Data Analysis
UCLA Department of Statistics Statistical Consulting Center Irina Kukuyeva [email protected] April 26, 2010 Outline 1 Summary Plots 2 Time Series Plots 3 Geographical Plots 4 3D Plots 5 Simulation
Getting Started with R and RStudio 1
Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following
5 Correlation and Data Exploration
5 Correlation and Data Exploration Correlation In Unit 3, we did some correlation analyses of data from studies related to the acquisition order and acquisition difficulty of English morphemes by both
Graphics in R. Biostatistics 615/815
Graphics in R Biostatistics 615/815 Last Lecture Introduction to R Programming Controlling Loops Defining your own functions Today Introduction to Graphics in R Examples of commonly used graphics functions
Scientific data visualization
Scientific data visualization Using ggplot2 Sacha Epskamp University of Amsterdam Department of Psychological Methods 11-04-2014 Hadley Wickham Hadley Wickham Evolution of data visualization Scientific
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize
Data Visualization with R Language
1 Data Visualization with R Language DENG, Xiaodong ([email protected] ) Research Assistant Saw Swee Hock School of Public Health, National University of Singapore Why Visualize Data? For better
Each function call carries out a single task associated with drawing the graph.
Chapter 3 Graphics with R 3.1 Low-Level Graphics R has extensive facilities for producing graphs. There are both low- and high-level graphics facilities. The low-level graphics facilities provide basic
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Graphical Representation of Multivariate Data
Graphical Representation of Multivariate Data One difficulty with multivariate data is their visualization, in particular when p > 3. At the very least, we can construct pairwise scatter plots of variables.
Cluster Analysis using R
Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other
Hierarchical Clustering Analysis
Hierarchical Clustering Analysis What is Hierarchical Clustering? Hierarchical clustering is used to group similar objects into clusters. In the beginning, each row and/or column is considered a cluster.
Chapter 4 Creating Charts and Graphs
Calc Guide Chapter 4 OpenOffice.org Copyright This document is Copyright 2006 by its contributors as listed in the section titled Authors. You can distribute it and/or modify it under the terms of either
Tutorial 2: Descriptive Statistics and Exploratory Data Analysis
Tutorial 2: Descriptive Statistics and Exploratory Data Analysis Rob Nicholls [email protected] MRC LMB Statistics Course 2014 A very basic understanding of the R software environment is assumed.
Big Data and Scripting. Plotting in R
1, Big Data and Scripting Plotting in R 2, the art of plotting: first steps fundament of plotting in R: plot(x,y) plot some random values: plot(runif(10)) values are interpreted as y-values, x-values filled
TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:
Table of contents: Access Data for Analysis Data file types Format assumptions Data from Excel Information links Add multiple data tables Create & Interpret Visualizations Table Pie Chart Cross Table Treemap
Lecture 2: Exploratory Data Analysis with R
Lecture 2: Exploratory Data Analysis with R Last Time: 1. Introduction: Why use R? / Syllabus 2. R as calculator 3. Import/Export of datasets 4. Data structures 5. Getting help, adding packages 6. Homework
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode
Iris Sample Data Set Basic Visualization Techniques: Charts, Graphs and Maps CS598 Information Visualization Spring 2010 Many of the exploratory data techniques are illustrated with the Iris Plant data
Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler
Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?
Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary
Visualization Quick Guide
Visualization Quick Guide A best practice guide to help you find the right visualization for your data WHAT IS DOMO? Domo is a new form of business intelligence (BI) unlike anything before an executive
Data Exploration Data Visualization
Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select
COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3
COMP 5318 Data Exploration and Analysis Chapter 3 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping
Plotting: Customizing the Graph
Plotting: Customizing the Graph Data Plots: General Tips Making a Data Plot Active Within a graph layer, only one data plot can be active. A data plot must be set active before you can use the Data Selector
Data Mining and Visualization
Data Mining and Visualization Jeremy Walton NAG Ltd, Oxford Overview Data mining components Functionality Example application Quality control Visualization Use of 3D Example application Market research
Formulas, Functions and Charts
Formulas, Functions and Charts :: 167 8 Formulas, Functions and Charts 8.1 INTRODUCTION In this leson you can enter formula and functions and perform mathematical calcualtions. You will also be able to
Diagrams and Graphs of Statistical Data
Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in
Introduction Course in SPSS - Evening 1
ETH Zürich Seminar für Statistik Introduction Course in SPSS - Evening 1 Seminar für Statistik, ETH Zürich All data used during the course can be downloaded from the following ftp server: ftp://stat.ethz.ch/u/sfs/spsskurs/
Prof. Nicolai Meinshausen Regression FS 2014. R Exercises
Prof. Nicolai Meinshausen Regression FS 2014 R Exercises 1. The goal of this exercise is to get acquainted with different abilities of the R statistical software. It is recommended to use the distributed
MARS STUDENT IMAGING PROJECT
MARS STUDENT IMAGING PROJECT Data Analysis Practice Guide Mars Education Program Arizona State University Data Analysis Practice Guide This set of activities is designed to help you organize data you collect
Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.
Information Visualization Multivariate Data Visualization Krešimir Matković
Information Visualization Multivariate Data Visualization Krešimir Matković Vienna University of Technology, VRVis Research Center, Vienna Multivariable >3D Data Tables have so many variables that orthogonal
Package RIGHT. March 30, 2015
Type Package Title R Interactive Graphics via HTML Version 0.2.0 Date 2015-03-30 Package RIGHT March 30, 2015 Author ChungHa Sung, TaeJoon Song, JongHyun Bae, SangGi Hong, Jae W. Lee, and Junghoon Lee
Journal of Statistical Software
JSS Journal of Statistical Software October 2006, Volume 16, Code Snippet 3. http://www.jstatsoft.org/ LDheatmap: An R Function for Graphical Display of Pairwise Linkage Disequilibria between Single Nucleotide
All Visualizations Documentation
All Visualizations Documentation All Visualizations Documentation 2 Copyright and Trademarks Licensed Materials - Property of IBM. Copyright IBM Corp. 2013 IBM, the IBM logo, and Cognos are trademarks
Microsoft Excel 2010 Charts and Graphs
Microsoft Excel 2010 Charts and Graphs Email: [email protected] Web Page: http://training.health.ufl.edu Microsoft Excel 2010: Charts and Graphs 2.0 hours Topics include data groupings; creating
Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering
Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques
Figure 1. An embedded chart on a worksheet.
8. Excel Charts and Analysis ToolPak Charts, also known as graphs, have been an integral part of spreadsheets since the early days of Lotus 1-2-3. Charting features have improved significantly over the
Linear Discriminant Analysis
Fiche TD avec le logiciel : course5 Linear Discriminant Analysis A.B. Dufour Contents 1 Fisher s iris dataset 2 2 The principle 5 2.1 Linking one variable and a factor.................. 5 2.2 Linking a
Analysis of System Performance IN2072 Chapter M Matlab Tutorial
Chair for Network Architectures and Services Prof. Carle Department of Computer Science TU München Analysis of System Performance IN2072 Chapter M Matlab Tutorial Dr. Alexander Klein Prof. Dr.-Ing. Georg
Exploratory Spatial Data Analysis
Exploratory Spatial Data Analysis Part II Dynamically Linked Views 1 Contents Introduction: why to use non-cartographic data displays Display linking by object highlighting Dynamic Query Object classification
Scientific Graphing in Excel 2010
Scientific Graphing in Excel 2010 When you start Excel, you will see the screen below. Various parts of the display are labelled in red, with arrows, to define the terms used in the remainder of this overview.
Tutorial for proteome data analysis using the Perseus software platform
Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information
Data Visualization Handbook
SAP Lumira Data Visualization Handbook www.saplumira.com 1 Table of Content 3 Introduction 20 Ranking 4 Know Your Purpose 23 Part-to-Whole 5 Know Your Data 25 Distribution 9 Crafting Your Message 29 Correlation
Exploratory Data Analysis and Plotting
Exploratory Data Analysis and Plotting The purpose of this handout is to introduce you to working with and manipulating data in R, as well as how you can begin to create figures from the ground up. 1 Importing
STC: Descriptive Statistics in Excel 2013. Running Descriptive and Correlational Analysis in Excel 2013
Running Descriptive and Correlational Analysis in Excel 2013 Tips for coding a survey Use short phrases for your data table headers to keep your worksheet neat, you can always edit the labels in tables
Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures
Introductory Statistics Lectures Visualizing Data Descriptive Statistics I Department of Mathematics Pima Community College Redistribution of this material is prohibited without written permission of the
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
Clustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
Psychology 205: Research Methods in Psychology
Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready
Common Tools for Displaying and Communicating Data for Process Improvement
Common Tools for Displaying and Communicating Data for Process Improvement Packet includes: Tool Use Page # Box and Whisker Plot Check Sheet Control Chart Histogram Pareto Diagram Run Chart Scatter Plot
Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data
Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable
Lecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501
CLUSTER ANALYSIS Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 January 2006 Introduction Cluster analysis includes a broad suite of techniques designed to find groups
Visualizing Data: Scalable Interactivity
Visualizing Data: Scalable Interactivity The best data visualizations illustrate hidden information and structure contained in a data set. As access to large data sets has grown, so has the need for interactive
Distances, Clustering, and Classification. Heatmaps
Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be
Interactive Voting System. www.ivsystem.nl. IVS-Basic IVS-Professional 4.4
Interactive Voting System www.ivsystem.nl IVS-Basic IVS-Professional 4.4 Manual IVS-Basic 4.4 IVS-Professional 4.4 1213 Interactive Voting System The Interactive Voting System (IVS ) is an interactive
Advanced Statistical Methods in Insurance
Advanced Statistical Methods in Insurance 7. Multivariate Data All Pairwise Scattergrams Iris Data Set: 3 Species 50 Cases of each with p=4 measurements per case 2 Hudec & Schlögl 1 3-d Scatterplots iris[,
Data exploration with Microsoft Excel: analysing more than one variable
Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
Tutorial: ggplot2. Ramon Saccilotto Universitätsspital Basel Hebelstrasse 10 T 061 265 34 07 F 061 265 31 09 [email protected] www.ceb-institute.
Tutorial: ggplot Ramon Saccilotto Universitätsspital Basel Hebelstrasse T 7 F 9 [email protected] www.ceb-institute.org About the ggplot Package Introduction "ggplot is an R package for producing statistical,
03 The full syllabus. 03 The full syllabus continued. For more information visit www.cimaglobal.com PAPER C03 FUNDAMENTALS OF BUSINESS MATHEMATICS
0 The full syllabus 0 The full syllabus continued PAPER C0 FUNDAMENTALS OF BUSINESS MATHEMATICS Syllabus overview This paper primarily deals with the tools and techniques to understand the mathematics
Visualization methods for patent data
Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes
Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering
Portal Connector Fields and Widgets Technical Documentation
Portal Connector Fields and Widgets Technical Documentation 1 Form Fields 1.1 Content 1.1.1 CRM Form Configuration The CRM Form Configuration manages all the fields on the form and defines how the fields
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
MultiExperiment Viewer Quickstart Guide
MultiExperiment Viewer Quickstart Guide Table of Contents: I. Preface - 2 II. Installing MeV - 2 III. Opening a Data Set - 2 IV. Filtering - 6 V. Clustering a. HCL - 8 b. K-means - 11 VI. Modules a. T-test
Data Visualization. Scientific Principles, Design Choices and Implementation in LabKey. Cory Nathe Software Engineer, LabKey cnathe@labkey.
Data Visualization Scientific Principles, Design Choices and Implementation in LabKey Catherine Richards, PhD, MPH Staff Scientist, HICOR [email protected] Cory Nathe Software Engineer, LabKey [email protected]
Data representation and analysis in Excel
Page 1 Data representation and analysis in Excel Let s Get Started! This course will teach you how to analyze data and make charts in Excel so that the data may be represented in a visual way that reflects
A Demonstration of Hierarchical Clustering
Recitation Supplement: Hierarchical Clustering and Principal Component Analysis in SAS November 18, 2002 The Methods In addition to K-means clustering, SAS provides several other types of unsupervised
Multivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
MicroStrategy Analytics Express User Guide
MicroStrategy Analytics Express User Guide Analyzing Data with MicroStrategy Analytics Express Version: 4.0 Document Number: 09770040 CONTENTS 1. Getting Started with MicroStrategy Analytics Express Introduction...
Analytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
sample median Sample quartiles sample deciles sample quantiles sample percentiles Exercise 1 five number summary # Create and view a sorted
Sample uartiles We have seen that the sample median of a data set {x 1, x, x,, x n }, sorted in increasing order, is a value that divides it in such a way, that exactly half (i.e., 50%) of the sample observations
Spreadsheet. Parts of a Spreadsheet. Entry Bar
Spreadsheet Parts of a Spreadsheet 1. Open the AppleWorks program. Select spreadsheet. 2. Explore the spreadsheet setup for a while. Active Cell Address Entry Bar Column Headings Row Headings Active Cell
business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar
business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel
Package tagcloud. R topics documented: July 3, 2015
Package tagcloud July 3, 2015 Type Package Title Tag Clouds Version 0.6 Date 2015-07-02 Author January Weiner Maintainer January Weiner Description Generating Tag and Word Clouds.
Instructions for SPSS 21
1 Instructions for SPSS 21 1 Introduction... 2 1.1 Opening the SPSS program... 2 1.2 General... 2 2 Data inputting and processing... 2 2.1 Manual input and data processing... 2 2.2 Saving data... 3 2.3
Data Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA
Paper 156-2010 The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA Abstract JMP has a rich set of visual displays that can help you see the information
Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501
PRINCIPAL COMPONENTS ANALYSIS (PCA) Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 May 2008 Introduction Suppose we had measured two variables, length and width, and
Exploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
Using Excel for Handling, Graphing, and Analyzing Scientific Data:
Using Excel for Handling, Graphing, and Analyzing Scientific Data: A Resource for Science and Mathematics Students Scott A. Sinex Barbara A. Gage Department of Physical Sciences and Engineering Prince
Using Excel for descriptive statistics
FACT SHEET Using Excel for descriptive statistics Introduction Biologists no longer routinely plot graphs by hand or rely on calculators to carry out difficult and tedious statistical calculations. These
Periodontology. Digital Art Guidelines JOURNAL OF. Monochrome Combination Halftones (grayscale or color images with text and/or line art)
JOURNAL OF Periodontology Digital Art Guidelines In order to meet the Journal of Periodontology s quality standards for publication, it is important that authors submit digital art that conforms to the
A Layered Grammar of Graphics
Please see the online version of this article for supplementary materials. A Layered Grammar of Graphics Hadley WICKHAM A grammar of graphics is a tool that enables us to concisely describe the components
Statistics Revision Sheet Question 6 of Paper 2
Statistics Revision Sheet Question 6 of Paper The Statistics question is concerned mainly with the following terms. The Mean and the Median and are two ways of measuring the average. sumof values no. of
Bar Graphs and Dot Plots
CONDENSED L E S S O N 1.1 Bar Graphs and Dot Plots In this lesson you will interpret and create a variety of graphs find some summary values for a data set draw conclusions about a data set based on graphs
Creating Charts in Microsoft Excel A supplement to Chapter 5 of Quantitative Approaches in Business Studies
Creating Charts in Microsoft Excel A supplement to Chapter 5 of Quantitative Approaches in Business Studies Components of a Chart 1 Chart types 2 Data tables 4 The Chart Wizard 5 Column Charts 7 Line charts
GeoGebra Statistics and Probability
GeoGebra Statistics and Probability Project Maths Development Team 2013 www.projectmaths.ie Page 1 of 24 Index Activity Topic Page 1 Introduction GeoGebra Statistics 3 2 To calculate the Sum, Mean, Count,
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering
Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar
