Introduction to R and Exploratory data analysis

Size: px
Start display at page:

Download "Introduction to R and Exploratory data analysis"

Transcription

1 Introduction to R and Exploratory data analysis Gavin Simpson November 2006 Summary In this practical class we will introduce you to working with R. You will complete an introductory session with R and then use a data set of Spheroidal Carbonaceous Particle (SCP) surface chemistry and demonstrate some of the Exploratory Data Analysis (EDA) functions in R. Finally, we introduce the concept of statistical tests in R through a small study of fecudity in a predatory gastropod on intertidal shores. 1 Your first R session R can be used as a glorified calculator; you enter calculations at the prompt, R works out the answer and displays it to you. Try the following sequence of commands. Remember, only type the commands on lines starting with the prompt > below. In each case, the output from R is shown. > 3 * 5 [1] 15 > [1] 9 > 100/5 [1] 20 > 99-1 [1] 98 1

2 After each sum R prints out the result with a [1] at the start of the line. This means the result is a vector of length 1. As you will see later, this string shows you which entry in the vector is displayed at the beginning of each line. This isn t much use if you can t store results though. To store the result of a calculation, the assignment operator <- is used. > a <- 3 * 5 Notice that this time R does not print the answer to the console. You have just created an object in R with the name a. To print out the object to the console you just have to type the name of the object and press Enter. > a [1] 15 We can now use this to do sums like you would on a calculator using the memory functions. > b < > a + b [1] 24 > a * b [1] 135 > c <- a * b > c [1] 135 To see a list of the objects in the current workspace use the ls() function: > ls() [1] "a" "b" "c" 2

3 2 Exploratory Data Analysis Use R s EDA functions to examine the SCP data with a view to answering the following questions: 1. Suggest which chemical elements give the best discrimination between coal and oil particles; 2. Suggest which variables are highly correlated and may be dropped from the analysis; 3. Suggest which particles (either coal or oil) have an unusual chemical composition i. e., are outliers; SCP s are produced by the high temperature combustion of fossil fuels in oil and coal-fired power stations. Since SCP s and sulphur (as SO 2 ) are emitted from the same sources and are dispersed in the same way, the record of SCP s in lake sediments may be used to infer the spatial and temporal patterns of sulphur deposition, an acid loading on lake systems. In addition, SCP s produced by the combustion of coal and oil have different chemical signatures, so characterization of particles in a sediment core can be used to partition the pollution loading into distinct sources. The data set consists of a sample of 100 SCP s from two power stations (50 SCP s from each). One, Pembroke, is an oil-fired station, the other, Drax, is coal-fired. The data form a training set that was used to generate a discriminant function for classifying SCP s derived from lake sediments into fuel type. The samples of fly-ash from power station flues were analysed using Energy Dispersive Spectroscopy (EDS) and the chemical composition was measured. These data are available in the file scp.txt, a standard comma-delimited ASCII text file. You will use R in this practical class, learn how to read the data into R and how to use the software to perform standard EDA and graphics. 3 Reading the data into R Firstly, start R on the system you are using in the prac1 directory that contains the files for this practical, instructions on how to do this have been circulated on a separate handout. Your screen should look similar (but not the same) to the one shown in Figure 1. The R prompt is a > character and commands are entered via the keyboard and subsequently evaluated after the user presses the return key. The following commands read the data from the scp.txt file into a R object called scp.dat and perform some rudimentary manipulation of the data: > scp.dat <- read.csv(file = "scp.txt", row.names = 1) > scp.dat[, 18] <- as.factor(scp.dat[, 18]) > levels(scp.dat[, 18]) <- c("coal", "Oil") The read.csv() function reads the data into an object, scp.dat. The argument row.names instructs R to use the first column of data as the row names. 3

4 Figure 1: The R interface; in this case the R GUI interface under MicroSoft Windows By default R assigns names to each observation. As the scp.txt file contains a column labelled "SampleID", which contains the sample names, we now convert the first column of the scp.dat object to characters (text) and then assign these as the sample names using the rownames function. Note that we subset the scp.dat object using the following notation: object[r,c] Here r and c represent the row and column required, respectively. To select the first column of scp.dat we use scp.dat[,1]), leaving the row identifier blank. The last variable in scp.dat is FuelType. This variable is currently coded as "1", "0". R has a special way of dealing with factors, but we need to tell R that FuelType is a factor (FuelType is currently a numeric variable). The function as.factor() is used to coerce a vector of data from one type into a factor (scp.dat[,18] <- as.factor(scp.dat[,18]). The levels of FuelType are still coded as 1, 0. We can replace the current levels with something more useful using levels(scp.dat[,18]) <- c("coal", "Oil"). Note the use of the concatenate function, c(), to create a character vector of length 2. Simply typing the name of an object followed by return prints the contents of that object to the screen (increase the size of your R console window before doing this otherwise the printed contents will extend over many pages). > scp.dat > names(scp.dat) > str(scp.dat) The names() function prints out the names describing the contents of the object. The output 4

5 from names depends on the class of the object passed as an argument 1 to names(). For a data frame like scp.dat, names() prints out the labels for the columns, the variables. The str() function prints out the structure of the object. The output from str() shows that scp.dat contains 17 numeric variables and one (FuelType) factor variable with the levels "Coal" and "Oil". 4 Summary statistics Simple summary statistics can be generated using summary(): > summary(scp.dat) summary() is a generic function, used to summarize an object passed as an argument to it. For data frames, numeric matrices and vectors, summary() prints out the minimum and maximum values, the mean, the median and the upper and lower quartiles for each variable in the object. For factor variables, the number of observations of each level is displayed. Another way of generating specific summary information is to use the corresponding function, e.g. mean(), min(), range() or median(): > mean(scp.dat[, 1]) [1] Doing this for all 17 variables and for each of the descriptive statistical functions could quickly become tedious. R contains many functions that allow us to quickly apply functions across each variable in a data frame. For example, to get the mean for each variable in the scp.dat data frame you would type: > apply(scp.dat[, -18], 2, mean) Mg Al Si P S Cl K Ca Ti V Cr Mn Fe Ni Cu Zn As its name suggests, apply() takes as it argument a vector or array and applies a named function on either the rows, or the columns of that vector or array (the 2 in the above call indicate that we want to work on the columns). Apply can be used to run any appropriate function including a users own custom functions. As an example, we now calculate the trimmed mean: 1 Arguments are the values entered between the brackets of a call to a function, and indicate a variety of options, such as the name of an object to work on or which method to use. A function s arguments are documented in the help pages for that function 5

6 > apply(scp.dat[, -18], 2, mean, trim = 0.1) Mg Al Si P S Cl K Ca Ti V Cr Mn Fe Ni Cu Zn In the above code snippet, we pass an additional argument to mean(), trim = 0.1, which calculates the trimmed mean, by trimming 10% of the data off each end of the distribution and then calculating the mean of the remaining data points. How this works is quite simple. apply() is told to work on columns 1 through 17 of the scp.dat object. Each column in turn is passed to our function and mean() calculates the trimmed mean. Try using apply() to calculate the range (range()), the variance (var()) and the standard deviation (sd()) for variables 1 through 17 in scp.dat. Having now looked at the summary statistics for the entire data set, we can now start to look for differences in SCP surface chemistry between the two fuel types. This time we need to subset the data and calculate the summary statistics for each level in FuelType. The are many ways of subsetting a data frame, but the easiest way is to use the aggregatey() function. First, attach scp.dat to the search path 2 using: > attach(scp.dat) Now try the following code snippet: > aggregate(scp.dat[, -18], list(fueltype), mean) Group.1 Mg Al Si P S Cl 1 Coal Oil K Ca Ti V Cr Mn Fe Ni Cu Zn > aggregate(scp.dat[, -18], list(fueltype), mean, trim = 0.1) 2 Attaching an object to the search path tells R to look inside that object for variables as well as in the standard environment. By attaching scp.dat to the search path we can use the variable names directly. 6

7 Group.1 Mg Al Si P S 1 Coal Oil Cl K Ca Ti V Cr Mn Fe Ni Cu Zn As before, using the summary functions you have already tried, look for differences in the surface chemistry of SCP s derived from the two different fuel types. Two more standard descriptive statistical functions commonly used are provided in the package e1071; skewness() and kurtosis(). To use them we have to load the e1071 package using the library() or require() functions. library() loads a named package, whilst require() checks to see whether the named package is loaded and loads it if it isn t. Are there differences in the skewness and kurtosis of the chemical variables for the two fuel type? > require(e1071) [1] TRUE > aggregate(scp.dat[, -18], list(fueltype), kurtosis) Group.1 Mg Al Si P 1 Coal Oil S Cl K Ca Ti V Cr Mn Fe Ni Cu Zn > aggregate(scp.dat[, -18], list(fueltype), skewness) Group.1 Mg Al Si P 1 Coal Oil S Cl K Ca Ti V Cr Mn Fe Ni Cu 7

8 Zn Univariate graphical exploratory data analysis This section of the practical will demonstrate the graphical abilities of R for exploratory data analysis for univariate data. 5.1 Stem and leaf plots A stem and leaf plot is a useful way of describing data. Like a histogram, a stem and leaf plot visualizes the distribution. In R, a stem and leaf plot is produced using stem(). > stem() The decimal point is 1 digit(s) to the right of the Using stem() look at the distribution of some of the other variables in scp.dat. 5.2 Histograms Like stem and leaf plots, histograms can be used to get a rough impression of the distribution. Histograms are drawn by breaking the distribution into a number of sections or bins, which are generally of equal width. You will probably have seen frequency histograms used where the number of observations per bin is calculated and graphed. An alternative approach is to plot the relative frequency of observation per bin width (probability density estimates more on density estimates later). Graphs showing histograms of the same data using the two methods look identical, except for the scaling on the y axis. 8

9 In R, the histogram function is called hist(). For this practical, however, you will use the truehist() function from the MASS package, because it offers a wider range of options for drawing histograms. MASS should be distributed as a recommended package in your R distribution. > require(mass) [1] TRUE > oldpar <- par(mfrow = c(3, 1)) > hist(, col = "grey", main = "R default", ylab = "Frequency", + freq = FALSE) > truehist(, nbins = "FD", col = "grey", main = "Freedman-Diaconis rule", + ylab = "Frequency") > truehist(, nbins = "scott", col = "grey", main = "Scott's rule", + ylab = "Frequency") > par(oldpar) > rm(oldpar) Note the use of the par() function above to split the display into three rows and 1 column using the mfrow argument. By assigning the current contents of par() to oldpar and then setting mfrow = c(3,1) we can reset the original par() settings using par(oldpar) when we are finished. The figure produced should look similar to Figure 2. For sodium (), Scott s rule and the default setting for calculating the number of bins produce the same result. The Freedman-Diaconis rule, however, produces a histogram with a greater number of bins. Note also that we have restricted this example to frequency histograms. In the next section probability density histograms will be required. The recommended number of bins using the Freedman-Diaconis rule is generated using: n 1/3 (max min), 2(Q 3 Q 1 where n is the number of observations, max min is the range of the data, Q 3 Q 1 is the interquartile range. The brackets represent the ceiling, which indicates that you round up to the next integer (so you don t end up with 5.7 bins for example!) 5.3 Density estimation A useful alternative to histograms is nonparametric density estimation, which results in a smoothing of the histogram. The kernel-density estimate at the value x of a variable X is given by: ˆf(x) = 1 n ( x xj K b b j=1 9 ),

10 Figure 2: Histograms illustrating the use of different methods for calculating the number of bins. R default Frequency Freedman Diaconis rule Frequency Scott s rule Frequency Figure 3: Comparison between histogram and density estimation techniques. Freedman Diaconis rule Density

11 Figure 4: A quantile quantile plot of Sodium from the SCP surface chemistry data set. Normal Q Q Plot Sample Quantiles Theoretical Quantiles where x j are the n observations of X, K is a kernel function (such as the normal density) and b is a bandwidth parameter controlling the degree of smoothing. Small bandwidths produce rough density estimates whilst large bandwidths produce smoother estimates. It is easier to illustrate the features of density estimation than to explain them mathematically. The code snippet below draws a histogram and then overlays two density estimates using different bandwidths. The code also illustrates the use of the lines() functions to build up successive layers of graphics on a single plot. > truehist(, nbins = "FD", col = "grey", prob = TRUE, + ylab = "Density", main = "Freedman-Diaconis rule") > lines(density(), lwd = 2) > lines(density(, adjust = 0.5), lwd = 1) > rug() > box() Note the use of the argument prob = TRUE in the truehist() call above, which draws a histogram so that it is scaled similarly to the density estimates. You should see something similar to Figure 3. Examine the probability density estimates for some of the other SCP surface chemistry variables. 5.4 Quantile quantile plots Quantile quantile or Q-Q plots are a useful tool for determining whether your data are normally distributed or not. Q-Q plots are produced using the function qqnorm() and its counterpart 11

12 qqline(). Q-Q plots illustrate the relationship between the distribution of a variable and a reference or theoretical distribution. We will confine ourselves to using the normal distribution as our reference distribution, though other R functions can be used to compare data that conform to other distributions. A Q-Q plot graphs the relationship between our ordered data and the corresponding quantiles of the reference distribution. To illustrate this, type in the following code snippet: > qqnorm() > qqline() If the data are normally distributed they should plot on a straight line passing through the 1 st and 3 rd quartiles. The line added using qqline aids the interpretation of this plot. Where there is a break in slope of the plotted points, the data deviate from the reference distribution. From the example above, we can see that the data are reasonably normally distributed but have a slightly longer left tail (Figure 4). Plot Q-Q plots for some of the chemical variables. Suggest which of the variables are normally distributed and which variables are right- or left-skewed. 5.5 Boxplots Boxplots are another useful tool for displaying the properties of numerical data, and as we will see in a minute, are useful for comparing the distribution of a variable across two or more groups. Boxplots are also known as box-whisker plots on account of their appearance. Boxplots are produced using the boxplot() function. Here, many of the enhanced features of boxplots are illustrated: > boxplot(, notch = TRUE, col = "grey", ylab = "", + main = "Boxplot of Sodium", boxwex = 0.5) Figure 5 shows the resulting boxplot. The box is drawn between the 1 st and 3 rd quartiles, with the median being represented by the horizontal line running through the box. The notches either side of the median are used to compare boxplots between groups. If the notches of any two boxplots do not overlap then the median values of a variable for those two groups are significantly different at the 95% level. When plotting only a few boxplots per diagram, the boxwex argument can be used to control the width of the box, which often results in a better plot. Here we scale the box width by 50%. The boxplot for Sodium () in Figure 5 again illustrates that these data are almost normally distributed, the whiskers are about the same length with only a few observations at the lower end extended past the whiskers. As mentioned earlier, boxplots are particularly useful for comparing the properties of a variable among two or more groups. The boxplot() function allows us to specify the grouping variable and the variable to be plotted using a standard formula notation. We will see more of this in the Regression practical. Try the following code snippet: 12

13 Figure 5: A boxplot of Sodium from the SCP surface chemistry data set. Boxplot of Sodium Figure 6: A boxplot of Sodium from the SCP surface chemistry data set by fuel type. Boxplot of Sodium surface chemistry by Fuel Type Coal Oil > boxplot( ~ FuelType, notch = TRUE, col = "grey", + ylab = "", main = "Boxplot of Sodium surface chemistry by fuel type", + boxwex = 0.5, varwidth = TRUE) Figure 6 shows the resulting plot. The formula notation takes the form: variable grouping variable The only new argument used above is varwidth = TRUE, which plots the widths of the boxes proportionally to the variance of the data. This can visually illustrate differences in a variance among groups. Are the median values for split by FuelType significantly different at the 95 % level?. Plot boxplots for some of the other variables by FuelType. Suggest which variables show different distributions between fuel types. Before we move on to bivariate and multivariate graphical EDA, lets look at an extended example that illustrates the methods we learned above and demonstrates how to put together a composite figure using R s plotting functions. Type the entire code snippet into the R console: > oldpar <- par(mfrow = c(2, 2), oma = c(0, 0, 2, 0) ) > truehist(, nbins = "FD", col = "gray", main = "Histogram of ", + ylab = "Frequency", prob = FALSE) > box() 13

14 Figure 7: Bringing together the univariate descriptive statistical tools learned so far. Graphical Exploratory Data Analysis for Sodium Histogram of Kernel density estimation Frequency Density N = 100 Bandwidth = Normal Q Q Plot Boxplot of Sodium Sample Quantiles Theoretical Quantiles > plot(density(), lwd = 1, main = "Kernel density estimation") > rug() > qqnorm() > qqline() > boxplot(, notch = TRUE, col = "gray", ylab = "", + main = "Boxplot of Sodium") > mtext("graphical Exploratory Data Analysis for Sodium", + font = 4, outer = TRUE) > par(oldpar) > rm(oldpar) The resulting plot is shown in Figure 7. Most of the code you just used should be familiar to you by now. Each plotting region has a margin, determined by the mar argument to par(). Because we split the plotting device into 4 plotting regions each of these regions has a margin for the axis annotation. To get an outer margin we need to set the oma parameter. The oma = c(0, 0, 2, 0) argument to par() sets up an outer margin around the entire plot, of 0.1 lines on the bottom and both sides, 14

15 and 2.1 lines on the top. The next seven commands draw the individual plots and then the last new function, mtext() is used to add a title to the outer plot margin (denoted by the outer = TRUE argument) in a bold italic font (denoted by the font = 4 argument). 6 Bivariate and multivariate graphical data analysis This section of the practical will demonstrate the graphical abilities of R for exploratory data analysis of bivariate and multivariate data. The standard R plotting function is plot(). plot is a generic function and works in different ways depending upon what type of data is passed as an argument to it. 6.1 Scatter plots The most simple bivariate plot is a scatter plot: > plot(x =, y = S, main = "Scatter plot of sodium against sulphur") The x and y variables are passed as arguments to plot() and we define a main title for the plot. There are lots of embellishments one can use to alter the appearance of the plot, but one of the most useful is a scatter plot smoother, which should highlight the relationship between the two plotted variables. > plot(x =, y = S, main = "Scatter plot of sodium against sulphur") > lines(lowess(x =, y = S), col = "red", lwd = 1) Here we just renew the original plot for clarity, but technically you only need to enter the lines() command if the scatter plot is still visible in the plotting device and was the last plot made. The resulting plot is shown in Figure 8. To understand how this works type the following at the R prompt 3 : > lowess(x =, y = S) The lowess() function takes as its arguments (in its most simplest form) two vectors, x and y corresponding to the two variables or interest, and returns an object with two vectors of points corresponding to the smoothed value fitted at each pair of x and y. The lines() function expects to be passed pairs of coordinates (x and y), and in our example draws a line of width 1 (lwd = 1, which isn t strictly necessary as the default is to use whatever lwd is set to in par()) in red (col = "red"). So you can see that we do not need to save the results of lowess() in order to plot them. This is a very useful feature of R, being able to nest function calls inside one another, and will be used more frequently throughout the remaining practical classes. 3 Note that your output will look slightly different depending on the width of your console window 15

16 Figure 8: A simple scatter plot of sodium against sulphur with a LOWESS smoother drawn through the data. Scatter plot of sodium against sulphur S Coded scatter plots and other enhancements In the previous section the use of the plot() function to draw simple scatter plots was demonstrated. Now we illustrate a few additional R functions that can be used to make better-looking and more informative plots. Try the following: > plot(x =, y = S, pch = c(21, 3)[FuelType], col = c("red", + "blue")[fueltype], main = "Scatter plot of sodium against sulphur") > lines(lowess(x = [FuelType == "Coal"], y = S[FuelType == + "Coal"]), col = "red", lwd = 1) > lines(lowess(x = [FuelType == "Oil"], y = S[FuelType == + "Oil"]), col = "blue", lwd = 1) > legend("topleft", legend = c("coal", "Oil"), col = c("red", + "blue"), pch = c(21, 3)) The resulting plot is shown in Figure 9. The first command should be mostly familiar to you by now. We have specified the plotting characters (using pch = c(21,3)). What is special about this though is that we use the FuelType variable to determine which character is plotted. This works because the c() function creates a vector from its arguments. It is possible to subset a vector in much the same way as you did earlier for a data frame, but because a vector has only a single column, we only need to specify which entry in the vector we want. FuelType is a factor and evaluates to 1 or 2 when used in this way, depending on the value of each corresponding FuelType. Which character is plotted is determined by whether FuelType is coal or oil. We use the same notation to change the plotting colour from "red" to "blue". 16

17 Figure 9: A coded scatter plot of sodium against sulphur by fuel type. A LOWESS smoother is drawn through the data points for each fuel type. Scatter plot of sodium against sulphur S Coal Oil A similar method is used to plot the LOWESS scatter plot smoothers for each fuel type. This time however, we only want to fit a LOWESS a single fuel type at a time. So we select values from and S where the corresponding FuelType is either coal or oil. We have to repeat this twice, once of each level in FuelType 4. The final command make use of the legend() function to create a legend for the plot. The first argument to legend() should be a vector of coordinates for the upper left corner of the box that will contain the legend. These coordinates need to be given in terms of the scale of the current plot region. Instead we use the locator() function to set the location of the legend interactively. Draw coded scatter plots in the same way for other variables in the SCP data set. Which variables seem to best discriminate between the two fuel types? 6.3 Scatter plot matrices When there are more than two variables of interest it is useful to plot a scatter plot matrix, which plots each variable against every other one. The pairs() function is the standard R graphics function for plotting scatter plot matrices 5. Enter the following code snippet and look at the resulting plot: 4 There are ways of automating this in R code, but this involves writing a custom function using a loop and so is beyond the scope of this practical 5 I mentioned earlier that plot() was a generic function. If you were to pass plot() a matrix or data frame containing three or more variables, plot() would call pairs to perform the plot instead of say plot.default() say, which was used behind the scenes to produce the plots in Figure 9 17

18 Figure 10: A scatter plot matrix showing the relationship between Sodium, Nickel, Sulphur and Iron. A LOWESS smoother is plotted though each individual plot Figure 11: A scatter plot matrix showing the relationship between Sodium, Nickel, Sulphur and Iron. Red circles denote Coal-fired power stations and blue crosses, Oil-fired ones Ni S Ni S Fe Fe > pairs(cbind(, Ni, S, Fe), gap = 0, panel = panel.smooth) The plot generated is shown in Figure 10. Lets take a moment to explain the above functions. Firstly, the cbind() function (column bind) has been used to stitch together the four named vectors (variables) into a matrix (or data frame). cbind() works in a similar way to c() but instead of creating a vector from its arguments, cbind() takes vectors (and other R objects) and creates a matrix or data frame from the arguments. We have done this, because pairs() needs to work on a matrix or data frame of variables. The final argument in the call to pairs() is the panel argument. The details of panel are beyond the scope of this practical, but suffice it to say that panel is used to plot any appropriate function on the upper and lower panels of the scatter plot matrix (the individual scatter plots). Later in the practical you will use a custom panel function to add features to the diagonal panel where currently only the variable label is displayed. In the above code snippet we make use of an existing panel function, panel.smooth which plots a LOWESS smoother through each of the panels using the default span (bandwidth 6 ). As an alternative to Figure 10, we can plot coded scatter plot matrices like so: > pairs(cbind(, Ni, S, Fe), pch = c(21, 3)[FuelType], + col = c("red", "blue")[fueltype], gap = 0) 6 You will learn more about the bandwidth parameter in lowess() in the practical on Advanced Regression techniques. 18

19 This example should be much more familiar to you than the previous one, as the structure of the commands are similar to the single coded scatter plot earlier. The same syntax is used to subset data into each fuel type and are plotted in different characters and colours. The resulting plot is shown in Figure 11. Plot different combinations of the variables in the SCP data set in scatter plot matrices. Which variables seem to best discriminate between the two fuel types? 7 Univariate statistical tests Ward and Quinn (1988) collected 37 egg capsules of the intertidal predatory gastropod Lepsiella vinosa from the litorinid zone on a rocky intertidal shore and 42 capsules from the mussel zone. Other data indicated that rates of energy consumption by L. vinosa were much greater in the mussel zone so there was interest in differences of fecundity between the mussel zone and the litorinid zone. Our null hypothesis, H 0 s, is that there is no difference between the zones in the mean number of eggs per capsule. This is an independent comparison becasue individual egg capsules can only be in either of the two zones. As such, a two-sample t test is appropriate to test for a difference in mean number of eggs per capsule between the zones. Begin by reading in the gastropod data, display the number of egg capsules collected in each zone and finally calculate the mean eggs per capsule of the two zones: > gastropod <- read.csv("gastropod.csv") > table(gastropod$zone) Littor Mussel > with(gastropod, aggregate(eggs, list(zone), mean)) Group.1 x 1 Littor Mussel As always, it is a good idea to plot the data before attempting any further analysis. A boxplot is appropriate in this case. > boxplot(eggs ~ zone, data = gastropod, varwidth = TRUE, + notch = TRUE) Notice that the two groups have similar variance (the widths of the boxes are quite similar), and the notches do not overlap, suggesting that there is a significanct difference in the mean eggs per capsule of the two zones. We can test this formally, using the t.test() function, which can be used to perform a variety of t tests. 19

20 > egg.ttest <- t.test(eggs ~ zone, data = gastropod) > egg.ttest Welch Two Sample t-test data: eggs by zone t = , df = , p-value = 6.192e-07 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean in group Littor mean in group Mussel Look at the output from the function. Identify the t statistic and the number of degrees of freedom for the test. Is there a significant difference between the mean number of eggs per capsule from the two zones? The 95 per cent confidence interval is for the difference in means, and does not include 0, suggesting a significiant difference in the means of the two zones. 8 Summing up In this practical you have seen how R can be used to perform a range of exploratory data analysis tasks and simple statistical tests. You have also seen how to produce publication quality graphics using R commands. 20

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures Introductory Statistics Lectures Visualizing Data Descriptive Statistics I Department of Mathematics Pima Community College Redistribution of this material is prohibited without written permission of the

More information

Graphics in R. Biostatistics 615/815

Graphics in R. Biostatistics 615/815 Graphics in R Biostatistics 615/815 Last Lecture Introduction to R Programming Controlling Loops Defining your own functions Today Introduction to Graphics in R Examples of commonly used graphics functions

More information

Data exploration with Microsoft Excel: analysing more than one variable

Data exploration with Microsoft Excel: analysing more than one variable Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical

More information

Data Exploration Data Visualization

Data Exploration Data Visualization Data Exploration Data Visualization What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping to select

More information

Exploratory Data Analysis

Exploratory Data Analysis Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction

More information

Viewing Ecological data using R graphics

Viewing Ecological data using R graphics Biostatistics Illustrations in Viewing Ecological data using R graphics A.B. Dufour & N. Pettorelli April 9, 2009 Presentation of the principal graphics dealing with discrete or continuous variables. Course

More information

How Does My TI-84 Do That

How Does My TI-84 Do That How Does My TI-84 Do That A guide to using the TI-84 for statistics Austin Peay State University Clarksville, Tennessee How Does My TI-84 Do That A guide to using the TI-84 for statistics Table of Contents

More information

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

More information

Diagrams and Graphs of Statistical Data

Diagrams and Graphs of Statistical Data Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in

More information

5 Correlation and Data Exploration

5 Correlation and Data Exploration 5 Correlation and Data Exploration Correlation In Unit 3, we did some correlation analyses of data from studies related to the acquisition order and acquisition difficulty of English morphemes by both

More information

Data exploration with Microsoft Excel: univariate analysis

Data exploration with Microsoft Excel: univariate analysis Data exploration with Microsoft Excel: univariate analysis Contents 1 Introduction... 1 2 Exploring a variable s frequency distribution... 2 3 Calculating measures of central tendency... 16 4 Calculating

More information

Variables. Exploratory Data Analysis

Variables. Exploratory Data Analysis Exploratory Data Analysis Exploratory Data Analysis involves both graphical displays of data and numerical summaries of data. A common situation is for a data set to be represented as a matrix. There is

More information

Introduction Course in SPSS - Evening 1

Introduction Course in SPSS - Evening 1 ETH Zürich Seminar für Statistik Introduction Course in SPSS - Evening 1 Seminar für Statistik, ETH Zürich All data used during the course can be downloaded from the following ftp server: ftp://stat.ethz.ch/u/sfs/spsskurs/

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members

More information

4 Other useful features on the course web page. 5 Accessing SAS

4 Other useful features on the course web page. 5 Accessing SAS 1 Using SAS outside of ITCs Statistical Methods and Computing, 22S:30/105 Instructor: Cowles Lab 1 Jan 31, 2014 You can access SAS from off campus by using the ITC Virtual Desktop Go to https://virtualdesktopuiowaedu

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

Getting started with qplot

Getting started with qplot Chapter 2 Getting started with qplot 2.1 Introduction In this chapter, you will learn to make a wide variety of plots with your first ggplot2 function, qplot(), short for quick plot. qplot makes it easy

More information

Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16

Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16 Summary of R software commands used to generate bootstrap and permutation test output and figures in Chapter 16 Since R is command line driven and the primary software of Chapter 16, this document details

More information

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

More information

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

Each function call carries out a single task associated with drawing the graph.

Each function call carries out a single task associated with drawing the graph. Chapter 3 Graphics with R 3.1 Low-Level Graphics R has extensive facilities for producing graphs. There are both low- and high-level graphics facilities. The low-level graphics facilities provide basic

More information

Statistical Models in R

Statistical Models in R Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate

More information

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques

More information

An introduction to using Microsoft Excel for quantitative data analysis

An introduction to using Microsoft Excel for quantitative data analysis Contents An introduction to using Microsoft Excel for quantitative data analysis 1 Introduction... 1 2 Why use Excel?... 2 3 Quantitative data analysis tools in Excel... 3 4 Entering your data... 6 5 Preparing

More information

Dongfeng Li. Autumn 2010

Dongfeng Li. Autumn 2010 Autumn 2010 Chapter Contents Some statistics background; ; Comparing means and proportions; variance. Students should master the basic concepts, descriptive statistics measures and graphs, basic hypothesis

More information

DATA INTERPRETATION AND STATISTICS

DATA INTERPRETATION AND STATISTICS PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE

More information

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon t-tests in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com www.excelmasterseries.com

More information

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS SECTION 2-1: OVERVIEW Chapter 2 Describing, Exploring and Comparing Data 19 In this chapter, we will use the capabilities of Excel to help us look more carefully at sets of data. We can do this by re-organizing

More information

Getting Started with R and RStudio 1

Getting Started with R and RStudio 1 Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following

More information

OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE

OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE Hukum Chandra Indian Agricultural Statistics Research Institute, New Delhi-110012 1. INTRODUCTION R is a free software environment for statistical computing

More information

Lecture 2: Exploratory Data Analysis with R

Lecture 2: Exploratory Data Analysis with R Lecture 2: Exploratory Data Analysis with R Last Time: 1. Introduction: Why use R? / Syllabus 2. R as calculator 3. Import/Export of datasets 4. Data structures 5. Getting help, adding packages 6. Homework

More information

determining relationships among the explanatory variables, and

determining relationships among the explanatory variables, and Chapter 4 Exploratory Data Analysis A first look at the data. As mentioned in Chapter 1, exploratory data analysis or EDA is a critical first step in analyzing the data from an experiment. Here are the

More information

Introduction to Exploratory Data Analysis

Introduction to Exploratory Data Analysis Introduction to Exploratory Data Analysis A SpaceStat Software Tutorial Copyright 2013, BioMedware, Inc. (www.biomedware.com). All rights reserved. SpaceStat and BioMedware are trademarks of BioMedware,

More information

Final Exam Practice Problem Answers

Final Exam Practice Problem Answers Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal

More information

Analysis of System Performance IN2072 Chapter M Matlab Tutorial

Analysis of System Performance IN2072 Chapter M Matlab Tutorial Chair for Network Architectures and Services Prof. Carle Department of Computer Science TU München Analysis of System Performance IN2072 Chapter M Matlab Tutorial Dr. Alexander Klein Prof. Dr.-Ing. Georg

More information

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

There are six different windows that can be opened when using SPSS. The following will give a description of each of them. SPSS Basics Tutorial 1: SPSS Windows There are six different windows that can be opened when using SPSS. The following will give a description of each of them. The Data Editor The Data Editor is a spreadsheet

More information

2. Filling Data Gaps, Data validation & Descriptive Statistics

2. Filling Data Gaps, Data validation & Descriptive Statistics 2. Filling Data Gaps, Data validation & Descriptive Statistics Dr. Prasad Modak Background Data collected from field may suffer from these problems Data may contain gaps ( = no readings during this period)

More information

Statistics Chapter 2

Statistics Chapter 2 Statistics Chapter 2 Frequency Tables A frequency table organizes quantitative data. partitions data into classes (intervals). shows how many data values are in each class. Test Score Number of Students

More information

How To Write A Data Analysis

How To Write A Data Analysis Mathematics Probability and Statistics Curriculum Guide Revised 2010 This page is intentionally left blank. Introduction The Mathematics Curriculum Guide serves as a guide for teachers when planning instruction

More information

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To

More information

Cluster Analysis using R

Cluster Analysis using R Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other

More information

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol R Graphics Cookbook Winston Chang Beijing Cambridge Farnham Koln Sebastopol O'REILLY Tokyo Table of Contents Preface ix 1. R Basics 1 1.1. Installing a Package 1 1.2. Loading a Package 2 1.3. Loading a

More information

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals

More information

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences. 1 Commands in JMP and Statcrunch Below are a set of commands in JMP and Statcrunch which facilitate a basic statistical analysis. The first part concerns commands in JMP, the second part is for analysis

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Chapter 2 Exploratory Data Analysis

Chapter 2 Exploratory Data Analysis Chapter 2 Exploratory Data Analysis 2.1 Objectives Nowadays, most ecological research is done with hypothesis testing and modelling in mind. However, Exploratory Data Analysis (EDA), which uses visualization

More information

Foundation of Quantitative Data Analysis

Foundation of Quantitative Data Analysis Foundation of Quantitative Data Analysis Part 1: Data manipulation and descriptive statistics with SPSS/Excel HSRS #10 - October 17, 2013 Reference : A. Aczel, Complete Business Statistics. Chapters 1

More information

Chapter 7 Section 7.1: Inference for the Mean of a Population

Chapter 7 Section 7.1: Inference for the Mean of a Population Chapter 7 Section 7.1: Inference for the Mean of a Population Now let s look at a similar situation Take an SRS of size n Normal Population : N(, ). Both and are unknown parameters. Unlike what we used

More information

SPSS Manual for Introductory Applied Statistics: A Variable Approach

SPSS Manual for Introductory Applied Statistics: A Variable Approach SPSS Manual for Introductory Applied Statistics: A Variable Approach John Gabrosek Department of Statistics Grand Valley State University Allendale, MI USA August 2013 2 Copyright 2013 John Gabrosek. All

More information

GeoGebra Statistics and Probability

GeoGebra Statistics and Probability GeoGebra Statistics and Probability Project Maths Development Team 2013 www.projectmaths.ie Page 1 of 24 Index Activity Topic Page 1 Introduction GeoGebra Statistics 3 2 To calculate the Sum, Mean, Count,

More information

Psychology 205: Research Methods in Psychology

Psychology 205: Research Methods in Psychology Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready

More information

Figure 1. An embedded chart on a worksheet.

Figure 1. An embedded chart on a worksheet. 8. Excel Charts and Analysis ToolPak Charts, also known as graphs, have been an integral part of spreadsheets since the early days of Lotus 1-2-3. Charting features have improved significantly over the

More information

Lecture 1: Review and Exploratory Data Analysis (EDA)

Lecture 1: Review and Exploratory Data Analysis (EDA) Lecture 1: Review and Exploratory Data Analysis (EDA) Sandy Eckel seckel@jhsph.edu Department of Biostatistics, The Johns Hopkins University, Baltimore USA 21 April 2008 1 / 40 Course Information I Course

More information

Using Excel for descriptive statistics

Using Excel for descriptive statistics FACT SHEET Using Excel for descriptive statistics Introduction Biologists no longer routinely plot graphs by hand or rely on calculators to carry out difficult and tedious statistical calculations. These

More information

Using R for Linear Regression

Using R for Linear Regression Using R for Linear Regression In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional

More information

sample median Sample quartiles sample deciles sample quantiles sample percentiles Exercise 1 five number summary # Create and view a sorted

sample median Sample quartiles sample deciles sample quantiles sample percentiles Exercise 1 five number summary # Create and view a sorted Sample uartiles We have seen that the sample median of a data set {x 1, x, x,, x n }, sorted in increasing order, is a value that divides it in such a way, that exactly half (i.e., 50%) of the sample observations

More information

How To Test For Significance On A Data Set

How To Test For Significance On A Data Set Non-Parametric Univariate Tests: 1 Sample Sign Test 1 1 SAMPLE SIGN TEST A non-parametric equivalent of the 1 SAMPLE T-TEST. ASSUMPTIONS: Data is non-normally distributed, even after log transforming.

More information

CHARTS AND GRAPHS INTRODUCTION USING SPSS TO DRAW GRAPHS SPSS GRAPH OPTIONS CAG08

CHARTS AND GRAPHS INTRODUCTION USING SPSS TO DRAW GRAPHS SPSS GRAPH OPTIONS CAG08 CHARTS AND GRAPHS INTRODUCTION SPSS and Excel each contain a number of options for producing what are sometimes known as business graphics - i.e. statistical charts and diagrams. This handout explores

More information

Northumberland Knowledge

Northumberland Knowledge Northumberland Knowledge Know Guide How to Analyse Data - November 2012 - This page has been left blank 2 About this guide The Know Guides are a suite of documents that provide useful information about

More information

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Online Learning Centre Technology Step-by-Step - Excel Microsoft Excel is a spreadsheet software application

More information

Systat: Statistical Visualization Software

Systat: Statistical Visualization Software Systat: Statistical Visualization Software Hilary R. Hafner Jennifer L. DeWinter Steven G. Brown Theresa E. O Brien Sonoma Technology, Inc. Petaluma, CA Presented in Toledo, OH October 28, 2011 STI-910019-3946

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

Data Visualization in R

Data Visualization in R Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto Oct, 2014 Introduction Motivation for Data Visualization Humans are outstanding at detecting

More information

Assignment #03: Time Management with Excel

Assignment #03: Time Management with Excel Technical Module I Demonstrator: Dereatha Cross dac4303@ksu.edu Assignment #03: Time Management with Excel Introduction Success in any endeavor depends upon time management. One of the optional exercises

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

Projects Involving Statistics (& SPSS)

Projects Involving Statistics (& SPSS) Projects Involving Statistics (& SPSS) Academic Skills Advice Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs,

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4) Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Geostatistics Exploratory Analysis

Geostatistics Exploratory Analysis Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt

More information

Chapter 7 Section 1 Homework Set A

Chapter 7 Section 1 Homework Set A Chapter 7 Section 1 Homework Set A 7.15 Finding the critical value t *. What critical value t * from Table D (use software, go to the web and type t distribution applet) should be used to calculate the

More information

MBA 611 STATISTICS AND QUANTITATIVE METHODS

MBA 611 STATISTICS AND QUANTITATIVE METHODS MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is

More information

Introduction to StatsDirect, 11/05/2012 1

Introduction to StatsDirect, 11/05/2012 1 INTRODUCTION TO STATSDIRECT PART 1... 2 INTRODUCTION... 2 Why Use StatsDirect... 2 ACCESSING STATSDIRECT FOR WINDOWS XP... 4 DATA ENTRY... 5 Missing Data... 6 Opening an Excel Workbook... 6 Moving around

More information

Data analysis and regression in Stata

Data analysis and regression in Stata Data analysis and regression in Stata This handout shows how the weekly beer sales series might be analyzed with Stata (the software package now used for teaching stats at Kellogg), for purposes of comparing

More information

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96 1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

More information

An Introduction to Point Pattern Analysis using CrimeStat

An Introduction to Point Pattern Analysis using CrimeStat Introduction An Introduction to Point Pattern Analysis using CrimeStat Luc Anselin Spatial Analysis Laboratory Department of Agricultural and Consumer Economics University of Illinois, Urbana-Champaign

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.) Center: Finding the Median When we think of a typical value, we usually look for the center of the distribution. For a unimodal, symmetric distribution, it s easy to find the center it s just the center

More information

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi-110 012 seema@iasri.res.in Genomics A genome is an organism s

More information

Drawing a histogram using Excel

Drawing a histogram using Excel Drawing a histogram using Excel STEP 1: Examine the data to decide how many class intervals you need and what the class boundaries should be. (In an assignment you may be told what class boundaries to

More information

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS Mathematics Revision Guides Histograms, Cumulative Frequency and Box Plots Page 1 of 25 M.K. HOME TUITION Mathematics Revision Guides Level: GCSE Higher Tier HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

More information

Normality Testing in Excel

Normality Testing in Excel Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com

More information

STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc.

STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc. STATGRAPHICS Online Statistical Analysis and Data Visualization System Revised 6/21/2012 Copyright 2012 by StatPoint Technologies, Inc. All rights reserved. Table of Contents Introduction... 1 Chapter

More information

Using SPSS, Chapter 2: Descriptive Statistics

Using SPSS, Chapter 2: Descriptive Statistics 1 Using SPSS, Chapter 2: Descriptive Statistics Chapters 2.1 & 2.2 Descriptive Statistics 2 Mean, Standard Deviation, Variance, Range, Minimum, Maximum 2 Mean, Median, Mode, Standard Deviation, Variance,

More information

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode Iris Sample Data Set Basic Visualization Techniques: Charts, Graphs and Maps CS598 Information Visualization Spring 2010 Many of the exploratory data techniques are illustrated with the Iris Plant data

More information

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3 COMP 5318 Data Exploration and Analysis Chapter 3 What is data exploration? A preliminary exploration of the data to better understand its characteristics. Key motivations of data exploration include Helping

More information

A Short Guide to R with RStudio

A Short Guide to R with RStudio Short Guides to Microeconometrics Fall 2013 Prof. Dr. Kurt Schmidheiny Universität Basel A Short Guide to R with RStudio 1 Introduction 2 2 Installing R and RStudio 2 3 The RStudio Environment 2 4 Additions

More information

AMS 7L LAB #2 Spring, 2009. Exploratory Data Analysis

AMS 7L LAB #2 Spring, 2009. Exploratory Data Analysis AMS 7L LAB #2 Spring, 2009 Exploratory Data Analysis Name: Lab Section: Instructions: The TAs/lab assistants are available to help you if you have any questions about this lab exercise. If you have any

More information

Common Tools for Displaying and Communicating Data for Process Improvement

Common Tools for Displaying and Communicating Data for Process Improvement Common Tools for Displaying and Communicating Data for Process Improvement Packet includes: Tool Use Page # Box and Whisker Plot Check Sheet Control Chart Histogram Pareto Diagram Run Chart Scatter Plot

More information

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name: Glo bal Leadership M BA BUSINESS STATISTICS FINAL EXAM Name: INSTRUCTIONS 1. Do not open this exam until instructed to do so. 2. Be sure to fill in your name before starting the exam. 3. You have two hours

More information

Part 3. Comparing Groups. Chapter 7 Comparing Paired Groups 189. Chapter 8 Comparing Two Independent Groups 217

Part 3. Comparing Groups. Chapter 7 Comparing Paired Groups 189. Chapter 8 Comparing Two Independent Groups 217 Part 3 Comparing Groups Chapter 7 Comparing Paired Groups 189 Chapter 8 Comparing Two Independent Groups 217 Chapter 9 Comparing More Than Two Groups 257 188 Elementary Statistics Using SAS Chapter 7 Comparing

More information

Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Calculate counts, means, and standard deviations Produce

More information