5 Correlation and Data Exploration Correlation In Unit 3, we did some correlation analyses of data from studies related to the acquisition order and acquisition difficulty of English morphemes by both children and adult learners of L2 English. We used a Spearman Rank Order Correlation Test to compare the orders of different groups of learners and found that there were statistically significant relationships (i.e. p < 0.05). We also used a Pearson Correlation Test to find if the morpheme acquisition difficulties were similar across groups of learners. The results were mixed. Some showed statistically significant relationships but others did not. Correlation tests tell us how much two variables vary together. Figure 1 shows scatterplots of pairs of variables with different correlation strengths (r = 0.90, r = 0.50 and r = 0.00) with a regression line and its 95% confidence interval. (Regression is another statistical technique that is closely related to correlation. We shall look at it later.) The 95% confidence intervals show the range of regression lines that are possible based on the sample. The further they are apart, the less precise our regression line is likely to be. Figure 1. Scatterplots of variables at different correlation coefficients (r) with 95% confidence intervals In the scatterplot on the left (Figure 1), there is a very strong relationship (r = 0.90) between the variables. All the points are close to the regression line, and most of them are also in the bottom left and top right quadrants. The 95% confidence interval is also relatively narrow. In the 1
middle scatterplot the relationship is not as strong (r = 0.50). The points are more spread out and further from the regression line, but most are still in the bottom left and top right quadrants. The 95% confidence interval is also wider. In the scatterplot on the right, there is no relationship between the variables (r = 0.00). The points are randomly scattered over the graph, and there are roughly the same number in each of the four quadrants. The regression line cannot be seen because it is now I horizontal line that goes through the mean of y. The 95% confidence interval is also the widest. Data Exploration Data exploration means looking at our data in detail so that we can find its characteristics. It is an essential step before carrying out any statistical tests, although this seems to often be forgotten by researchers in the field of SLA. The first step is often to calculate the descriptive statistics: the mean, the median, the minimum, the maximum, the range, the standard deviation, 95% confidence intervals of the mean, skewness, kurtosis and standard error. Other data exploration techniques that are being used more and more are graphic techniques such as histograms, density plots, box plots, scatterplots with regression lines and/or smoothed trend (loess or lowess) lines and confidence intervals. Exploring the critical period hypothesis (dekeyser, 2000) In our quick look at correlations (above), one of our assumptions was that the relationship between the two variables is linear (a straight line). However, the Critical Period Hypothesis (CPH) claims that the relationship between Age of Acquisition (AoA) and ultimate attainment is non-linear (see Unit 5, Figure 1). In this section, we will explore the data from one study that claimed to support the CPH and see if it suggests that the relationship is non-linear. We will begin by making a scatterplot of the data with a regression line and its 95% confidence interval, and a loess (smoothed trend) line and its 95% confidence interval. The graph will look like Figure 2. In order to make this graph, you will need to have the ggplot2 package installed in R. (If it is not installed, follow the instructions in Appendix A to install it, or, if you cannot install it, follow the instructions in Appendix B for creating it with the built-in plotting functions.) The data you will need is in a file called dekeyser.txt. This file begins with a header, which contains the names of the variables ( AoA, GJT, Status ) and below them three columns of data. The columns (and variable names) are separated by an invisible tab character (or "\t" in R). The first few lines of the file look like this: 2
Figure 2. Scatterplot of scores on a grammaticality judgement task (GJT) and age of acquisition (AoA) with regression and loess lines and their 95% confidence intervals produced using the ggplot function in the ggplot2 package (data from dekeyser, 2000) "AoA" "GJT" "Status" 8 170 "Under 15" 11 181 "Under 15" 9 198 "Under 15" 11 194 "Under 15" 13 196 "Under 15" 4 193 "Under 15" First, you will need to read the data into R and store it in a variable. There are several ways to do this but the following one is the most similar to other software. The command has several parts. dekeyser is the name of the data frame that you are going to store the data in (you can choose another name if you prefer). read.table() is the function that will actually read the data. file.choose() is another function that will start an open file dialogue box, similar to other programs in Windows and Mac. header = TRUE indicates that the first line of the file is a header and NOT data. The 3
final argument sep = "\t" indicates that the columns are separated by tab character. Type the following (without "> ") and choose the file dekeyser.txt. > dekeyser <- read.table(file.choose(), header = TRUE, sep = "\t") If you get an error message just try again. Now let s see what things look like. Type: > head(dekeyser) AoA GJT Status 1 8 170 Under 15 2 11 181 Under 15 3 9 198 Under 15 4 11 194 Under 15 5 13 196 Under 15 6 4 193 Under 15 You should see the first six lines of the data. The first row (AoA GJT Status) is your header. You can also see that R has added row numbers (1 2 3...) at the beginning of each row of data. Now that the data has been imported, we can start to plot the graph. The first thing to do is to load the ggplot2 package. This is done with the library() function: > library(ggplot2) Next, we use the ggplot() function to plot the graph. ggplot() is a bit different from other functions we have used, and is made up of parts joined by a + symbol. > ggplot(data = dekeyser, aes(aoa, GJT)) + geom_point() + geom_smooth(method = "lm") + geom_smooth(colour = "red") The first part, ggplot(), initialises the plot but does not draw anything. In this example, it has two arguments. data = dekeyser tells ggplot to use the data frame called dekeyser, and aes(aoa, GJT) tells it to use the AoA and GJT variables (notice the order is x-axis, y-axis). geom_point() plots the points. geom_smooth() draws lines (and their 95% confidence intervals) calculated from the data. geom_smooth(method = "lm") draws a straight regression line, which is specified by the argument method = 4
"lm". geom_smooth(colour = "red"), draws a red loess trend line on the graph. Notice, the method does not need to be specified for a loess trend line because it is the default. > detach(package: ggplot2) Interpretation What does the graph we have produced tell us about the data? Is there any evidence for a Critical Period? Figure 3. Regression and Loess lines with 95% confidence intervals of the dekeyser (2000) data. I think the most important thing we need to look at is the regression line (blue) and the confidence intervals for the Loess line. If the regression line goes outside the confidence intervals for the Loess line, then there may be evidence for a Critical Period. In this case, we can see that it is outside the confidence intervals from about 20 to 24 years old. This however seems to be very late as the Critical Period is assumed to 5
end at puberty, which is usually thought to be from 13 to 15 years old. In other words, this data does not appear to support the Critical Period hypothesis. Of course, more sophisticated statistical techniques are needed to show whether this is likely to be true or not, but a visual analysis of the data can also be extremely helpful. Assignments Create similar graphs using the dekeyserisr.txt, dekeyserus.txt and FlegeSimple.txt. Is there any evidence of a Critical Period? 6
Appendix A Installing packages in R This section shows you how to install packages in R. For instructions on installing R, refer to: http://cse.niaes.affrc.go.jp/miwa/ja/r/setupreasy/ The easiest way to install new packages in R is to use the menus. First, click Packages (パッケージ) and select Install package(s) (パッケージ のインストール ). A list of servers will appear (see below). 7
Next, select the server from which to download the package. Here, the default server (0- Cloud) has been selected. If you prefer you may scroll down and select a server in Japan. After you do this, a list of packages appears. Scroll down this list until you find ggplot2. Select it and click OK. The package will be installed automatically. 8
Appendix A Plotting the data without the ggplot2 package Using the built in functions for plotting data (e.g., plot(), abline() and lines()) is more complicated than using functions in the ggplot2 package. The steps to produce Figure 2 are explained below. Figure 4. Scatterplot of scores on a grammaticality judgement task (GJT) and age of acquisition (AoA) with regression and loess lines and their 95% confidence intervals The data you will need is in a file called dekeyser.txt. This file begins with a header, which contains the names of the variables ( AoA, GJT, Status ) and below them three columns of data. The columns (and variable names) are separated by an invisible tab character (or "\t" in R). The first few lines of the file look like this: "AoA" "GJT" "Status" 9
8 170 "Under 15" 11 181 "Under 15" 9 198 "Under 15" 11 194 "Under 15" 13 196 "Under 15" 4 193 "Under 15" First, you will need to read the data into R and store it in a variable. There are several ways to do this but the following one is the most similar to other software. The command has several parts. dekeyser is the name of the variable that you are going to store the data in (you can choose another name if you prefer). read.table() is the function that will actually read the data. file.choose() is another function that will start an open file dialogue box, similar to other programs in Windows and Mac. header = TRUE indicates that the first line of the file is a header and NOT data. The final argument sep = "\t" indicates that the columns are separated by tab character. Type the following (without the leading "> ") and choose the file dekeyser.txt. > dekeyser <- read.table(file.choose(), header = TRUE, sep = "\t") If you get an error message just try again. Now let s see what things look like. Type: > head(dekeyser) AoA GJT Status 1 8 170 Under 15 2 11 181 Under 15 3 9 198 Under 15 4 11 194 Under 15 5 13 196 Under 15 6 4 193 Under 15 You should see the first six lines of the data. The first row (AoA GJT Status) is your header. You can also see that R has added row numbers (1 2 3...) at the beginning of each row of data. The variable dekeyser is different from the vector variables that we used before. It is a data frame variable. However, in order to use the variables in it like vectors, type the following: > attach(dekeyser) Now we shall, do some calculations that the graphics functions need in order to draw the lines. The first one lm() calculates the regression line for 10
GJT (x-axis) and AoA (x-axis) and stores it in dekeyser.lm. [After you done this, type dekeyser.lm to see what the regression line data looks like.] > dekeyser.lm <- lm(gjt ~ AoA) The next command stores a sequence of numbers in newx. [After you have done it, type newx to see what it looks like.] > newx <- seq(0, 45, 0.1) The next command, predict.lm(), calculates the predicted values of GJT and their confidence intervals and stores them in pred. Because AoA does not have many values, the 95% confidence lines may not be very smooth. newx is used instead of the original AoA values in order to make smoother lines. [Once again, you can type pred to see what this data looks like.] > pred <- predict.lm(dekeyser.lm, newdata = data.frame(aoa=newx), interval = "confidence") Now, we can start to plot the graph. First, the points and the regression line and its 95% confidence intervals. > plot(gjt ~ AoA, bty = "n", col = "grey", ylim = c(80,210)) > abline(dekeyser.lm) > lines(pred[,2]~newx, lty = 2, col = "grey") > lines(pred[,3]~newx, lty = 2, col = "grey") The next step, is to do the calculations for the loess trend line and confidence intervals. The sequences similar to that for the regression line but, because there are differences in the structure of pred (used for the regression line) and pred2, the arguments used for drawing the lines are different. > dekeyser.lo <- loess(gjt ~ AoA) > newx <- seq(0, 45, 0.1) > pred2 <- predict(dekeyser.lo, newdata = data.frame(aoa=newx), se = TRUE) > lines(pred2$fit~newx, col = "red4") > lines(pred2$fit - qt(0.975,pred2$df)*pred2$se~newx, lty = 2, col = "pink3") > lines(pred2$fit + qt(0.975,pred2$df)*pred2$se~newx, lty = 2, col = "pink3") 11
Summary of commands > attach(dekeyser) > dekeyser.lm <- lm(gjt ~ AoA) > newx <- seq(0, 45, 0.1) > pred <- predict.lm(dekeyser.lm, newdata = data.frame(aoa=newx), interval = "confidence") > plot(gjt ~ AoA, bty = "n", col = "grey", ylim = c(80,210)) > abline(dekeyser.lm) > lines(pred[,2]~newx, lty = 2, col = "grey") > lines(pred[,3]~newx, lty = 2, col = "grey") > dekeyser.lo <- loess(gjt ~ AoA) > newx <- seq(0, 45, 0.1) > pred2 <- predict(dekeyser.lo, newdata = data.frame(aoa=newx), se = TRUE) > lines(pred2$fit~newx, col = "red4") > lines(pred2$fit - qt(0.975,pred2$df)*pred2$se~newx, lty = 2, col = "pink3") > lines(pred2$fit + qt(0.975,pred2$df)*pred2$se~newx, lty = 2, col = "pink3") > detach(dekeyser) 12