Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto Oct, 2014
Introduction Motivation for Data Visualization Humans are outstanding at detecting patterns and structures with their eyes Data visualization methods try to explore these capabilities In spite of all advantages visualization methods also have several problems, particularly with very large data sets L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 2 / 58
Introduction Outline of what we will learn Tools for univariate data Tools for bivariate data Tools for multivariate data Multidimensional scaling methods L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 3 / 58
Introduction R Graphics L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 4 / 58
Introduction to Standard Graphics Standard Graphics (the graphics package) R standard graphics, available through package graphics, includes several functions that provide standard statistical plots, like: Scatterplots Boxplots Piecharts Barplots etc. These graphs can be obtained tyipically by a single function call Example of a scatterplot plot(1:10,sin(1:10)) These graphs can be easily augmented by adding several elements to these graphs (lines, text, etc.) L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 5 / 58
Introduction to Standard Graphics Standard Graphics (the graphics package) R standard graphics, available through package graphics, includes several functions that provide standard statistical plots, like: Scatterplots Boxplots Piecharts Barplots etc. These graphs can be obtained tyipically by a single function call Example of a scatterplot plot(1:10,sin(1:10)) These graphs can be easily augmented by adding several elements to these graphs (lines, text, etc.) L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 5 / 58
Introduction to Standard Graphics Standard Graphics (the graphics package) R standard graphics, available through package graphics, includes several functions that provide standard statistical plots, like: Scatterplots Boxplots Piecharts Barplots etc. These graphs can be obtained tyipically by a single function call Example of a scatterplot plot(1:10,sin(1:10)) These graphs can be easily augmented by adding several elements to these graphs (lines, text, etc.) L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 5 / 58
Graphics Devices Introduction to Standard Graphics R graphics functions produce output that depends on the active graphics device The default and more frequently used device is the screen There are many more graphical devices in R, like the pdf device, the jpeg device, etc. The user just needs to open (and in the end close) the graphics output device she/he wants. R takes care of producing the type of output required by the device This means that to produce a certain plot on the screen or as a GIF graphics file the R code is exactly the same. You only need to open the target output device before! Several devices may be open at the same time, but only one is the active device L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 6 / 58
Graphics Devices Introduction to Standard Graphics R graphics functions produce output that depends on the active graphics device The default and more frequently used device is the screen There are many more graphical devices in R, like the pdf device, the jpeg device, etc. The user just needs to open (and in the end close) the graphics output device she/he wants. R takes care of producing the type of output required by the device This means that to produce a certain plot on the screen or as a GIF graphics file the R code is exactly the same. You only need to open the target output device before! Several devices may be open at the same time, but only one is the active device L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 6 / 58
Graphics Devices Introduction to Standard Graphics R graphics functions produce output that depends on the active graphics device The default and more frequently used device is the screen There are many more graphical devices in R, like the pdf device, the jpeg device, etc. The user just needs to open (and in the end close) the graphics output device she/he wants. R takes care of producing the type of output required by the device This means that to produce a certain plot on the screen or as a GIF graphics file the R code is exactly the same. You only need to open the target output device before! Several devices may be open at the same time, but only one is the active device L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 6 / 58
Graphics Devices Introduction to Standard Graphics R graphics functions produce output that depends on the active graphics device The default and more frequently used device is the screen There are many more graphical devices in R, like the pdf device, the jpeg device, etc. The user just needs to open (and in the end close) the graphics output device she/he wants. R takes care of producing the type of output required by the device This means that to produce a certain plot on the screen or as a GIF graphics file the R code is exactly the same. You only need to open the target output device before! Several devices may be open at the same time, but only one is the active device L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 6 / 58
Graphics Devices Introduction to Standard Graphics R graphics functions produce output that depends on the active graphics device The default and more frequently used device is the screen There are many more graphical devices in R, like the pdf device, the jpeg device, etc. The user just needs to open (and in the end close) the graphics output device she/he wants. R takes care of producing the type of output required by the device This means that to produce a certain plot on the screen or as a GIF graphics file the R code is exactly the same. You only need to open the target output device before! Several devices may be open at the same time, but only one is the active device L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 6 / 58
Graphics Devices Introduction to Standard Graphics R graphics functions produce output that depends on the active graphics device The default and more frequently used device is the screen There are many more graphical devices in R, like the pdf device, the jpeg device, etc. The user just needs to open (and in the end close) the graphics output device she/he wants. R takes care of producing the type of output required by the device This means that to produce a certain plot on the screen or as a GIF graphics file the R code is exactly the same. You only need to open the target output device before! Several devices may be open at the same time, but only one is the active device L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 6 / 58
A few examples A scatterplot Introduction to Standard Graphics plot(seq(0,4*pi,0.1),sin(seq(0,4*pi,0.1))) The same but stored on a jpeg file jpeg('exp.jpg') plot(seq(0,4*pi,0.1),sin(seq(0,4*pi,0.1))) dev.off() And now as a pdf file pdf('exp.pdf',width=6,height=6) plot(seq(0,4*pi,0.1),sin(seq(0,4*pi,0.1))) dev.off() L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 7 / 58
Package ggplot2 Introduction to ggplot Package ggplot2 implements the ideas created by Wilkinson (2005) on a grammar of graphics This grammar is the result of a theoretical study on what is a statistical graphic ggplot2 builds upon this theory by implementing the concept of a layered grammar of graphics (Wickham, 2009) The grammar defines a statistical graphic as: a mapping from data into aesthetic attributes (color, shape, size, etc.) of geometric objects (points, lines, bars, etc.) L. Wilkinson (2005). The Grammar of Graphics. Springer. H. Wickham (2009). A layered grammar of graphics. Journal of Computational and Graphical Statistics. L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 8 / 58
Introduction to ggplot The Basics of the Grammar of Graphics Key elements of a statistical graphic: data aesthetic mappings geometric objects statistical transformations scales coordinate system faceting L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 9 / 58
Aesthetic Mappings Introduction to ggplot Controls the relation between data variables and graphic variables map the Temperature variable of a data set into the x variable in a scatter plot map the Species of a plant into the colour of dots in a graphic etc. L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 10 / 58
Geometric Objects Introduction to ggplot Controls what is shown in the graphics show each observation by a point using the aesthetic mappings that map two variables in the data set into the x, y variables of the plot etc. L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 11 / 58
Introduction to ggplot Statistical Transformations Allows us to calculate and do statistical analysis over the data in the plot Use the data and approximate it by a regression line on the x, y coordinates Count occurrences of certain values etc. L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 12 / 58
Introduction to ggplot Scales Maps the data values into values in the coordinate system of the graphics device L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 13 / 58
Coordinate System Introduction to ggplot The coordinate system used to plot the data Cartesian Polar etc. L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 14 / 58
Introduction to ggplot Faceting Split the data into sub-groups and draw sub-graphs for each group L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 15 / 58
Introduction to ggplot A Simple Example data(iris) library(ggplot2) ggplot(iris, aes(x=petal.length, y=sepal.length, colour=species ) ) + geom_point() 5 6 7 8 2 4 6 Petal.Length Sepal.Length Species setosa versicolor virginica L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 16 / 58
Univariate Plots Distribution of Values of Nominal Variables The Distribution of Values of Nominal Variables The Barplot The Barplot is a graph whose main purpose is to display a set of values as heights of bars It can be used to display the frequency of occurrence of different values of a nominal variable as follows: First obtain the number of occurrences of each value Then use the Barplot to display these counts L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 17 / 58
Univariate Plots Distribution of Values of Nominal Variables The Distribution of Values of Nominal Variables The Barplot The Barplot is a graph whose main purpose is to display a set of values as heights of bars It can be used to display the frequency of occurrence of different values of a nominal variable as follows: First obtain the number of occurrences of each value Then use the Barplot to display these counts L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 17 / 58
Univariate Plots Barplots in base graphics Distribution of Values of Nominal Variables barplot(table(algae$season), main='distribution of the Water Samples across Seasons') Distribution of the Water Samples across Seasons 0 10 20 30 40 50 60 autumn spring summer winter L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 18 / 58
Barplots in ggplot2 Univariate Plots Distribution of Values of Nominal Variables ggplot(algae,aes(x=season)) + geom_bar() + ggtitle('distribution of the Water Samples across Seasons') Distribution of the Water Samples across Seasons 60 40 count 20 0 autumn spring summer winter season L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 19 / 58
Pie Charts Univariate Plots Distribution of Values of Nominal Variables Pie charts serve the same purpose as bar plots but present the information in the form of a pie. pie(table(algae$season), main='distribution of the Water Samples across Seasons') Distribution of the Water Samples across Seasons spring autumn summer winter L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 20 / 58
Univariate Plots Distribution of Values of Nominal Variables Pie Charts in ggplot ggplot(algae,aes(x=factor(1),fill=season)) + geom_bar(width=1) + ggtitle('distribution of the Water Samples across Seasons') + coord_polar(theta="y") Distribution of the Water Samples across Seasons 0/200 factor(1) 1 150 50 season autumn spring summer winter 100 count L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 21 / 58
Univariate Plots Distribution of the Values of a Continuous Variable The Distribution of Values of a Continuous Variable The Histogram The Histogram is a graph whose main purpose is to display how the values of a continuous variable are distributed It is obtained as follows: First the range of the variable is divided into a set of bins (intervals of values) Then the number of occurrences of values on each bin is counted Then this number is displayed as a bar L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 22 / 58
Univariate Plots Distribution of the Values of a Continuous Variable The Distribution of Values of a Continuous Variable The Histogram The Histogram is a graph whose main purpose is to display how the values of a continuous variable are distributed It is obtained as follows: First the range of the variable is divided into a set of bins (intervals of values) Then the number of occurrences of values on each bin is counted Then this number is displayed as a bar L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 22 / 58
Univariate Plots Two examples of Histograms in R Distribution of the Values of a Continuous Variable hist(algae$a1,main='distribution of Algae a1', xlab='a1',ylab='nr. of Occurrencies') hist(algae$a1,main='distribution of Algae a1', xlab='a1',ylab='percentage',prob=true) Distribution of Algae a1 Distribution of Algae a1 Nr. of Occurrencies 0 20 40 60 80 100 Percentage 0.00 0.01 0.02 0.03 0.04 0.05 0 20 40 60 80 a1 0 20 40 60 80 a1 L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 23 / 58
Univariate Plots Distribution of the Values of a Continuous Variable Two examples of Histograms in ggplot2 ggplot(algae,aes(x=a1)) + geom_histogram(binwidth=10) + ggtitle("distribution of Algae a1") + ylab("nr. of Occurrencies") ggplot(algae,aes(x=a1)) + geom_histogram(binwidth=10,aes(y=..density..)) + ggtitle("distribution of Algae a1") + ylab("percentage") Distribution of Algae a1 Distribution of Algae a1 90 0.04 Nr. of Occurrencies 60 Percentage 0.02 30 0 0.00 0 25 50 75 100 a1 0 25 50 75 100 a1 L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 24 / 58
Univariate Plots Problems with Histograms Distribution of the Values of a Continuous Variable Histograms may be misleading in small data sets Another key issued is how the limits of the bins are chosen There are several algorithms for this Check the Details section of the help page of function hist() if you want to know more about this and to obtain references on alternatives Within ggplot2 you may control this through the binwidth parameter L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 25 / 58
Univariate Plots Kernel Density Estimation Showing the Distribution of Values Kernel Density Estimates Some of the problems of histograms can be tackled by smoothing the estimates of the distribution of the values. That is the purpose of kernel density estimates Kernel estimates calculate the estimate of the distribution at a certain point by smoothly averaging over the neighboring points Namely, the density is estimated by ˆf (x) = 1 n n ( ) x xi K h where K () is a kernel function and h a bandwidth parameter i=1 L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 26 / 58
Univariate Plots Kernel Density Estimation Showing the Distribution of Values Kernel Density Estimates Some of the problems of histograms can be tackled by smoothing the estimates of the distribution of the values. That is the purpose of kernel density estimates Kernel estimates calculate the estimate of the distribution at a certain point by smoothly averaging over the neighboring points Namely, the density is estimated by ˆf (x) = 1 n n ( ) x xi K h where K () is a kernel function and h a bandwidth parameter i=1 L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 26 / 58
Univariate Plots Kernel Density Estimation Showing the Distribution of Values Kernel Density Estimates Some of the problems of histograms can be tackled by smoothing the estimates of the distribution of the values. That is the purpose of kernel density estimates Kernel estimates calculate the estimate of the distribution at a certain point by smoothly averaging over the neighboring points Namely, the density is estimated by ˆf (x) = 1 n n ( ) x xi K h where K () is a kernel function and h a bandwidth parameter i=1 L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 26 / 58
Univariate Plots Kernel Density Estimation An Example and how to obtain it in R basic graphics hist(algae$mxph,main='the Histogram of mxph (maximum ph)', xlab='mxph Values',prob=T) lines(density(algae$mxph,na.rm=t)) rug(algae$mxph) The Histogram of mxph (maximum ph) Density 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 6 7 8 9 10 mxph Values L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 27 / 58
Univariate Plots Kernel Density Estimation An Example and how to obtain it in ggplot2 ggplot(algae,aes(x=mxph)) + geom_histogram(binwidth=.5, aes(y=..density..)) + geom_density(color="red") + geom_rug() + ggtitle("the Histogram of mxph (maximum ph)") The Histogram of mxph (maximum ph) 0.6 density 0.4 0.2 0.0 5 6 7 8 9 10 mxph L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 28 / 58
Univariate Plots Kernel Density Estimation Showing the Distribution of Values Graphing Quantiles The x quantile of a continuous variable is the value below which an x% proportion of the data is Examples of this concept are the 1st (25%) and 3rd (75%) quartiles and the median (50%) We can calculate these quantiles at different values of x and then plot them to provide an idea of how the values in a sample of data are distributed L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 29 / 58
Univariate Plots Kernel Density Estimation Showing the Distribution of Values Graphing Quantiles The x quantile of a continuous variable is the value below which an x% proportion of the data is Examples of this concept are the 1st (25%) and 3rd (75%) quartiles and the median (50%) We can calculate these quantiles at different values of x and then plot them to provide an idea of how the values in a sample of data are distributed L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 29 / 58
Univariate Plots Kernel Density Estimation Showing the Distribution of Values Graphing Quantiles The x quantile of a continuous variable is the value below which an x% proportion of the data is Examples of this concept are the 1st (25%) and 3rd (75%) quartiles and the median (50%) We can calculate these quantiles at different values of x and then plot them to provide an idea of how the values in a sample of data are distributed L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 29 / 58
Univariate Plots Kernel Density Estimation The Cumulative Distribution Function in Base Graphics plot(ecdf(algae$no3), main='cumulative Distribution Function of NO3',xlab='NO3 values') Cumulative Distribution Function of NO3 Fn(x) 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 NO3 values L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 30 / 58
Univariate Plots Kernel Density Estimation The Cumulative Distribution Function in ggplot ggplot(algae,aes(x=no3)) + stat_ecdf() + xlab('no3 values') + ggtitle('cumulative Distribution Function of NO3') 1.00 Cumulative Distribution Function of NO3 0.75 y 0.50 0.25 0.00 0 10 20 30 40 NO3 values L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 31 / 58
Univariate Plots Kernel Density Estimation QQ Plots Graphs that can be used to compare the observed distribution against the Normal distribution Can be used to visually check the hypothesis that the variable under study follows a normal distribution Obviously, more formal tests also exist L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 32 / 58
Univariate Plots Kernel Density Estimation QQ Plots Graphs that can be used to compare the observed distribution against the Normal distribution Can be used to visually check the hypothesis that the variable under study follows a normal distribution Obviously, more formal tests also exist L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 32 / 58
Univariate Plots Kernel Density Estimation QQ Plots Graphs that can be used to compare the observed distribution against the Normal distribution Can be used to visually check the hypothesis that the variable under study follows a normal distribution Obviously, more formal tests also exist L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 32 / 58
Univariate Plots Kernel Density Estimation QQ Plots from package car using base graphics library(car) qqplot(algae$mxph,main='qq Plot of Maximum PH Values') QQ Plot of Maximum PH Values algae$mxph 6 7 8 9 3 2 1 0 1 2 3 norm quantiles L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 33 / 58
Univariate Plots QQ Plots using ggplot2 Kernel Density Estimation ggplot(algae,aes(sample=mxph)) + stat_qq() + ggtitle('qq Plot of Maximum PH Values') QQ Plot of Maximum PH Values sample 9 8 7 6 3 2 1 0 1 2 3 theoretical L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 34 / 58
Univariate Plots QQ Plots using ggplot2 (2) Kernel Density Estimation q.x <- quantile(algae$mxph,c(0.25,0.75),na.rm=true) q.z <- qnorm(c(0.25,0.75)) b <- (q.x[2] - q.x[1])/(q.z[2] - q.z[1]) a <- q.x[1] - b * q.z[1] ggplot(algae,aes(sample=mxph)) + stat_qq() + ggtitle('qq Plot of Maximum PH Values') + geom_abline(intercept=a,slope=b,color="red") QQ Plot of Maximum PH Values sample 9 8 7 6 3 2 1 0 1 2 3 theoretical L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 35 / 58
Univariate Plots Kernel Density Estimation Showing the Distribution of Values Box Plots Box plots provide interesting summaries of a variable distribution For instance, they inform us of the interquartile range and of the outliers (if any) L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 36 / 58
Univariate Plots Kernel Density Estimation Showing the Distribution of Values Box Plots Box plots provide interesting summaries of a variable distribution For instance, they inform us of the interquartile range and of the outliers (if any) L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 36 / 58
Univariate Plots An Example with base graphics Kernel Density Estimation boxplot(algae$mno2,ylab='minimum values of O2') Minimum values of O2 2 4 6 8 10 12 L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 37 / 58
Univariate Plots An Example with base ggplot2 Kernel Density Estimation ggplot(algae,aes(x=factor(0),y=mno2)) + geom_boxplot() + ylab("minimum values of O2") + xlab("") + scale_x_discrete(breaks=null) 5 Minimum values of O210 L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 38 / 58
Univariate Plots Kernel Density Estimation Hands on Data Visualization - the Algae data set Using the Algae data set from package DMwR answer to the following questions: 1 Create a graph that you find adequate to show the distribution of the values of algae a6 solution 2 Show the distribution of the values of size solution 3 Check visually if it is plausible to consider that opo4 follows a normal distribution solution L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 39 / 58
Solution of Exercise 1 Univariate Plots Kernel Density Estimation Create a graph that you find adequate to show the distribution of the values of algae a6 hist(algae$a6,main="distribution of Algae a6",xlab="frequency of a6", ylab="nr. of Occurrences") Distribution of Algae a6 Nr. of Occurrences 0 50 100 150 0 20 40 60 80 Frequency of a6 L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 40 / 58
Univariate Plots Kernel Density Estimation Solution of Exercise 1 with ggplot2 Create a graph that you find adequate to show the distribution of the values of algae a6 ggplot(algae,aes(x=a6)) + geom_histogram(binwidth=10) + ggtitle("distribution of Algae a6") Distribution of Algae a6 150 100 count 50 0 0 25 50 75 a6 Go Back L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 41 / 58
Solution to Exercise 2 Univariate Plots Kernel Density Estimation Show the distribution of the values of size barplot(table(algae$size), main = "Distribution of River Size") Distribution of River Size 0 20 40 60 80 large medium small L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 42 / 58
Univariate Plots Kernel Density Estimation Solution to Exercise 2 with ggplot2 Show the distribution of the values of size ggplot(algae,aes(x=size)) + geom_bar() + ggtitle("distribution of River Size") Distribution of River Size 80 60 count 40 20 0 large medium small size Go Back L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 43 / 58
Univariate Plots Kernel Density Estimation Solutions to Exercise 3 Check visually if it is plausible to consider that opo4 follows a normal distribution library(car) qqplot(algae$opo4,main = "QQ Plot of opo4") QQ Plot of opo4 algae$opo4 0 100 200 300 400 500 3 2 1 0 1 2 3 norm quantiles Go Back L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 44 / 58
Conditioned Graphs Conditioned Graphs Data sets frequently have nominal variables that can be used to create sub-groups of the data according to these variables values e.g. the sub-group of male clients of a company Some of the visual summaries described before can be obtained on each of these sub-groups Conditioned plots allow the simultaneous presentation of these sub-group graphs to better allow finding eventual differences between the sub-groups Base graphics do not have conditioning but ggplot2 has it through the concept of facets L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 45 / 58
Conditioned Graphs Conditioned Graphs Data sets frequently have nominal variables that can be used to create sub-groups of the data according to these variables values e.g. the sub-group of male clients of a company Some of the visual summaries described before can be obtained on each of these sub-groups Conditioned plots allow the simultaneous presentation of these sub-group graphs to better allow finding eventual differences between the sub-groups Base graphics do not have conditioning but ggplot2 has it through the concept of facets L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 45 / 58
Conditioned Graphs Conditioned Graphs Data sets frequently have nominal variables that can be used to create sub-groups of the data according to these variables values e.g. the sub-group of male clients of a company Some of the visual summaries described before can be obtained on each of these sub-groups Conditioned plots allow the simultaneous presentation of these sub-group graphs to better allow finding eventual differences between the sub-groups Base graphics do not have conditioning but ggplot2 has it through the concept of facets L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 45 / 58
Conditioned Graphs Conditioned Graphs Data sets frequently have nominal variables that can be used to create sub-groups of the data according to these variables values e.g. the sub-group of male clients of a company Some of the visual summaries described before can be obtained on each of these sub-groups Conditioned plots allow the simultaneous presentation of these sub-group graphs to better allow finding eventual differences between the sub-groups Base graphics do not have conditioning but ggplot2 has it through the concept of facets L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 45 / 58
Conditioned Graphs Conditioned Histograms Conditioned histograms Goal: Constrast the distribution of data sub-groups algae$speed <- factor(algae$speed,levels=c("low","medium","high")) ggplot(algae,aes(x=a3)) + geom_histogram(binwidth=5) + facet_wrap(~ speed) ggtitle("distribution of A3 by River Speed") Distribution of A3 by River Speed low medium high 60 40 count 20 0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 a3 L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 46 / 58
Conditioned Graphs Conditioned histograms Conditioned Histograms (2) algae$season <- factor(algae$season,levels=c("spring","summer","autumn","winter")) ggplot(algae,aes(x=a3)) + geom_histogram(binwidth=5) + facet_grid(speed~season) + ggtitle('distribution of a3 as a function of River Speed and Season') Distribution of a3 as a function of River Speed and Season spring summer autumn winter 15 count 10 5 0 15 10 5 0 15 10 low medium high 5 0 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 a3 L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 47 / 58
Conditioned Graphs Conditioned Box Plots Conditioned Box Plots on base graphics boxplot(a6 ~ season,algae, main='distribution of a6 as a function of Season', xlab='season',ylab='distribution of a6') Distribution of a6 as a function of Season Distribution of a6 0 20 40 60 80 spring summer autumn winter Season L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 48 / 58
Conditioned Graphs Conditioned Box Plots Conditioned Box Plots on ggplot2 ggplot(algae,aes(x=season,y=a6))+geom_boxplot() + ggtitle('distribution of a6 as a function of Season and River Size') 80 Distribution of a6 as a function of Season and River Size 60 a6 40 20 0 spring summer autumn winter season L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 49 / 58
Conditioned Graphs Conditioned Box Plots Hands on Data Visualization - Algae data set Using the Algae data set from package DMwR answer to the following questions: 1 Produce a graph that allows you to understand how the values of NO3 are distributed across the sizes of river solution 2 Try to understand (using a graph) if the distribution of algae a1 varies with the speed of the river solution L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 50 / 58
Conditioned Graphs Conditioned Box Plots Solutions to Exercise 1 Produce a graph that allows you to understand how the values of NO3 are distributed across the sizes of river. ggplot(algae,aes(x=no3)) + geom_histogram(binwidth=5) + facet_wrap(~size) ggtitle("distribution of NO3 for different river sizes") Distribution of NO3 for different river sizes large medium small 60 40 count 20 0 0 20 40 0 20 40 0 20 40 NO3 Go Back L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 51 / 58
Conditioned Graphs Solutions to Exercise 2 Conditioned Box Plots Try to understand (using a graph) if the distribution of algae a1 varies with the speed of the river ggplot(algae,aes(x=speed,y=a1)) + geom_boxplot() + ggtitle("distribution of a1 for different River Speeds") Distribution of a1 for different River Speeds 75 a1 50 25 0 high low medium speed Go Back L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 52 / 58
Other Plots Scatterplots Scatterplots in base graphics The Scatterplot is the natural graph for showing the relationship between two numerical variables plot(algae$a1,algae$a6, main='relationship between the frequency of a1 and a6', xlab='a1',ylab='a6') 0 20 40 60 80 0 20 40 60 80 Relationship between the frequency of a1 and a6 a6 L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 53 / 58
Other Plots Scatterplots Scatterplots in ggplot2 ggplot(algae,aes(x=a1,y=a6)) + geom_point() + ggtitle('relationship between the frequency of a1 and a6') 0 20 40 60 80 0 25 50 75 a1 a6 Relationship between the frequency of a1 and a6 L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 54 / 58
Other Plots Scatterplots Conditioned Scatterplots using the ggplot2 package ggplot(algae,aes(x=a4,y=chla)) + geom_point() + facet_wrap(~season) autumn spring summer winter 0 30 60 90 0 30 60 90 0 10 20 30 40 0 10 20 30 40 a4 Chla L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 55 / 58
Other Plots Scatterplot matrices Scatterplot matrices through package GGally These graphs try to present all pairwise scatterplots in a data set. They are unfeasible for data sets with many variables. library(ggally) ggpairs(algae,columns=12:16, diag=list(continuous="density", discrete="bar"),axislabels="show") a1 a2 a3 a4 a5 75 50 25 0 60 40 20 0 40 30 20 10 0 40 30 20 10 0 Corr: 0.294 Corr: 0.147 Corr: 0.0321 Corr: 0.038 Corr: 0.172 Corr: 0.0123 Corr: 0.292 Corr: 0.16 Corr: 0.108 Corr: 0.109 40 30 20 10 0 0 255075 0 20 40 60 0 10203040 0 10203040 0 10203040 a1 a2 a3 a4 a5 L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 56 / 58
Other Plots Scatterplot matrices Scatterplot matrices involving nominal variables ggpairs(algae,columns=c("season","speed","a1","a2"),axislabels="show") season speed 60 40 20 0 20 10 0 20 10 0 20 10 0 autumn spring summer winter autumn spring summer winter high lowmedium a1 75 50 25 0 Corr: 0.294 a2 60 40 20 0 0102030102030102030102030020400204002040 0 25 50 75 0 20 40 60 season speed a1 a2 L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 57 / 58
Other Plots Parallel plots Parallel Plots Parallel plots are also interesting for visualizing a data set ggparcoord(algae,columns=12:16,groupcolumn="speed") 10.0 7.5 value 5.0 speed high low medium 2.5 0.0 a1 a2 a3 a4 a5 variable L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 58 / 58