Exploratory data analysis. An introduction to R for the language sciences

Transcription

1 Exploratory data analysis. An introduction to R for the language sciences R. H. Baayen Interfaculty research unit for language and speech, University of Nijmegen, & Max Planck Institute for Psycholinguistics, Nijmegen baayen@mpi.nl course materials Helsinki, version December 17, 2004 Introduction This book provides an introduction to the statistical analysis of quantitative data for researchers studying aspects of language and language processing. For many of my colleagues, the statistical analysis of quantitative data is an onerous task that they would rather leave to others. They tend to use statistical packages as a kind of oracle, from which you elicit a verdict as to whether you have one or more significant effects in your data. In order to elicit a response from the oracle, one has to click one s way through cascades of menues. After a magic button press, voluminous output tends to be produced that hides the p-values, the ultimate goal of the statistical pelgrimage, among lots of other numbers that are completely meaningless to the user, as befits a true oracle. The approach to data analysis to which this book provides a guide is fundamentally different in several ways. First of all, we will make use of a radically different tool for doing statistics, the interactive programming environment known as R. R is an open source implementation of the S language and environment for data analysis originally developed at Bell Laboratories. Learning to work with R is in many ways similar to learning a new language. Once you have mastered its grammar, and once you have acquired some basic vocabulary, you will also have begun to acquire a new way of thinking about data analysis that is essential for understanding the structure in your data. The design of R is especially elegant in that it has a consistent uniform syntax for specifying statistical models, no matter which type of model is being fitted. What is essential about working with R, and this brings us to the second difference in our approach, is that we will depend heavily on visualization. R has outstanding graphical facilities, which generally provide far more insight into the data than long lists of statistics that depend on often questionable simplifying assumptions. That is, this book provides an introduction to exploratory data analysis. Moreover, we will work incrementally and interactively. R is an object-oriented programming language. If you are not familiar with this term, you can think of R as a language in which a statistical model is an object, an object created by the researcher to capture the structure in the data. Once the object is created, there are many things you can do with that object. You can summarize the object in order to inspect parameters and their p-values, or you can plot the object in order to see how well it fits the data. Or you can update the object, or extract predictions from 1

2 the object, and so on. The process of understanding the structure in your data is almost always an iterative process involving graphical inspection, model building, graphical inspection, updating and adjusting the model, etc. The flexibility of R is crucial for making this iterative process both easy and enjoyable. A third, at first sight heretical aspect of this book is that we have avoided all formal maths. The focus of this book is on explaining the key concepts and on providing guidelines for the proper use of statistical techniques. A useful metaphor is learning to drive a car. In order to drive a car, you need to know the position and function of tools such as the steering wheel and the brake. You also need to know that you should not drive with the hand brake on. And you need to know the traffic rules. Without these three kinds of knowledge, driving a car is extremely dangerous. What you do not need to know is how to construct a combustion engine, or how to drill for oil and refine it so that you can use it to fuel a combustion engine. The aim of this book is to provide you with a driving licence for exploratory data analysis. There is one caveat here. To stretch the metaphor to its limit: With R, you are receiving driving lessons in an all-powerful car, a combination of a racing car, a lorry, a personal vehicle, and a limousine. Consequently, you have to be a responsible driver, which means that you will often find that you will need additional driving lessons beyond those offered in this book. Moreover, it never hurts to consult professional drivers statisticians with a solid background in mathematical statistics who know the ins and outs of the tools and techniques, and their advantages and disadvantages. Finally, the approach we have taken in this course is to work with real data sets rather than with small artificial examples. Real data are often messy, and it is important to know how to proceed when the data display all kinds of problems that standard introductory textbooks hardly ever mention. An important reason for using R is that it is a carefully designed programming environment that allows you, in a very flexible way, to write your own code, or modify existing code, to tailor R to your specific needs. Moreover, you can call R from another program (e.g., from scripting languages such as Python, Perl, or AWK) so that you do not have to do repeated similar analysis by hand one by one. To see why this is useful, consider a researcher studying similarities in meaning and form for a large number of words. Suppose that a separate model needs to be fitted for each of 1000 words to the data of the other 999 words. If you are used to thinking about statistical question as paths through cascaded menues, you will discard such an analysis as impractical almost immediately. When you work in R, it is a piece of cake, because you can write the code for one word, and then cycle it through all other words. If all the data is available at once, you could do this within R. If the data become available one chunk at the time, and if the joint data set is too large to load into R all at once, you can call R from another program to analyse the separate chunks. We have seen many instances of researchers being limited in the questions they explored because they were thinking in menu-driven language instead of in an interactive programming language like R. This is an area where language determines thought. If you are new to working with a programming language, you will find that you will 2

3 have to get used to getting your commands for R exactly right. Every comma, apostrophy, and bracket is important, and a single mistyped character will cause R to break and respond with a warning. There are command line editing facilities, and you can page through earlier commands with the up and down arrows of your keyboard. It is often more useful, however, to open a simple text editor (emacs, gvim, notepad), to prepare your commands in the editor, and to copy and paste finished commands into the R window. Especially more complex commands tend to be used more than once, and it is often much easier to make copies in the editor and modify these, then to try to edit multipleline commands in the R window itself. Output from R that is worth remembering can be pasted back into the editor, which in this way retains a detailed history of both your commands and of the relevant results. There are several ways in which you can use this book. If you use this book as an introduction to statistics, it is important to work through the examples, not only by reading them through, but by trying them out in R. Each chapter also comes with a set of problems, the solutions to these problems are provided at the end of the book. If you use this book to learn how to apply in R particular techniques that you are already familiar with, then the quickest way to proceed is to study the structure of the relevant data files used to illustrate the technique. Once you have understood how the data are to be formatted, you can load the data into R and try out the example. Once you have got this working, it should not be difficult to try out the same technique on your own data. This book is organized as follows. The first chapter describes the basics of the language, simple data structures, loading data, and exploring the structure in the data using various visualization techniques. The second chapter provides an introduction to random variables, distributions, and standard statistical tests for single random variables as well as tests for two random variables. Chapter 3 is an overview of exploratory techniques for clustering and classification. Chapter 4 introduces multiple regression, including analysis of (co)variance and multilevel modeling. 3

4 4

5 Chapter 1 Calculating and plotting in R In order to learn to work with R, you have to learn to speak its language. The grammar of the R language is beautifully and easy to learn. It is important to master the basics of R s grammar, as this grammar is designed to help you think about your data in a way that shows the way as to how you might want to analyse them. In this chapter, we begin with very simple examples that show you how to talk with R. As soon as possible, however, we will begin to use examples from a large experimental data set. When you start R, for instance, by typing R to the prompt of your Linux console, R responds by providing you with its own prompt (>). It also checks whether there is a file named.rdata in your current directory. If there is no such file, it creates one. If there already is such a file, indicating that you have worked on a problem in this directory before, it will load that file and make the objects stored in that file available to you. It is advisable to separate different projects in different directories, in order to avoid your workspace to become cluttered with lots of different unrelated objects. When you work with large data sets and complex objects in R, this is also the way to avoid that your.rdata file becomes unmanageably large. The way to learn a language is to start speaking it. The way to learn R is to use it. Reading through the examples in this chapter is not enough to become a confident user of R. For this, you need to actually try out the examples, by typing them at the R prompt. You have to be very precise in your commands, which requires a discipline that you will only master if you learn from experience, and learn from your mistakes. Don t be put off if R complains about your initial attempts to use it, just carefully compare what you typed, letter by letter and bracket by bracket, with the code in the examples. 1.1 Calculating with R Numbers and strings Once you have an R window, you can use R as an (overgrown) calculator, as shown in the following examples: 5

6 > # addition [1] 3 > 2 * 3 # multiplication [1] 6 > 6 / 3 # division [1] 2 > 2ˆ3 # power [1] 8 > sqrt(9) # square root [1] 3 > sqrt(9)ˆ3 [1] 27 Note, first of all, that R provides the answer as soon as you hit the return key. There is no need to supply the equal sign, and in fact you should not try to supply one. Second, the answers to all these examples are preceded by a [1]. We will return to why this is shortly. Third, the output of one expression sqrt(9) can serve immediately as the input of a second expression ˆ 3. You can save the output of any calculation in variables such as x or y by using the equal sign =, the assignment operator. You can also use the variables in calculations in the same way as numbers: > x = # assignment > x # request to display value [1] 3 > y = sqrt(16) # assignment > y # another request to display value [1] 4 > x ˆ y # working with two variables [1] 81 If you type the name of a variable, say x, to the R prompt, it returns the corresponding value, 3 in this example. It is also possible to store sequences of letters, technically known as strings, in a variable: > w = "word" > w [1] "word" Note that strings are enclosed between double quotes Vectors In most of your work with R, you will not be dealing with single numbers, or single strings, but with groups of numbers, or groups of strings. For instance, you might want 6

7 to add 1 to each of the numbers 1, 2, 3, and 4. The way to do this in R is to first combine these numbers into an ordered list, a vector, by means of the combination function c(): > x = c(1, 2, 3, 4) # combining numbers into a vector > x [1] Now that we have combined the numbers 1, 2, 3 and 4 into a vector, we can add 1 to each element as follows: > x + 1 # add one to each vector element [1] > y = x + 1 # store result in y > y # display y [1] Vectors are very useful when the same calculations have to be carried out on many pairs of numbers: > x = c(1, 2, 3, 4) > y = c(5, 6, 7, 8) > x + y [1] > x * y [1] > xˆ2 [1] > xˆ2 * y [1] In these examples, you can conceptualize x and y as two columns of numbers. Calculations are performed on the numbers in these columns that are on the same row. (It will be helpful to think of vectors as columns rather than as rows for many tools in R.) Vectors are fundamental to most of what you will be doing with R, and we will therefore discuss a number of ways for creating vectors, and for accessing the elements of a vector. The operator : creates an ascending or descending sequence of whole numbers (integers): > 1:4 [1] > 4:1 [1] > c(1:4, 4:1) [1] A more flexible sequencing function is seq, which gives you control of the increment: 7

8 > seq(1, 2, 0.1) [1] > seq(2, 0, -0.5) [1] The function rep is useful for repeating single numbers but also vectors: > rep(1, 4) [1] > rep(1:4, 4) [1] > rep(1:4, 4:1) [1] > rep(seq(1, 4, 0.5), 2) [1] Note that the second argument of rep specifies the number of repetitions. If this second argument is itself a vector, it specifies the number of repetitions required for each of the elements of the first argument, which should then itself be a vector. In the examples thus far, we have only examined vectors whose elements were numbers. However, vectors can also be constructed for strings, and the functions c() and rep() work just as before: > determiners = c("the", "an", "a") > determiners [1] "the" "an" "a" > abbreviations = c("1st", "2nd", "3rd") > abbreviations [1] "1st" "2nd" "3rd" > c(determiners, abbreviations) [1] "the" "an" "a" "1st" "2nd" "3rd" > rep(determiners, 3:1) [1] "the" "the" "the" "an" "an" "a" When presented with a vector, it is often necessary to access specific elements from that vector. This is done by means of a mechanism called subscripting. The position of the element to be extracted from the vector is added after the vector name between square brackets. When more than one element needs to be extracted, a vector of positions can be used instead of a single number. Here are some examples that illustrate the key principles. > determiners[1] [1] "the" > determiners[2] [1] "an" 8

9 > determiners[c(1, 3)] [1] "the" "a" > determiners[3:1] [1] "a" "an" "the" It is also possible to subscript with a condition that has to be met: > words = c("the", "cat", "sat", "on", "the", "mat") > words[words == "the"] # show the elements equal to "the" [1] "the" "the" > which(words == "the") # show the positions of these elements [1] 1 5 > words[which(words == "the")] [1] "the" "the" > words =="the" [1] TRUE FALSE FALSE FALSE TRUE FALSE # a vector of booleans > booleans = words == "the" > words[booleans] [1] "the" "the" When you subscript a vector with a vector (that should be of the same length), the result is a vector with those elements that correspond to the elements with the value TRUE in the boolean vector. The function which() is available for extracting the indexes of those elements in a vector that meet a given condition. Note that a double equal sign, ==, denotes equality, while a single equal sign is the assignment operator. The function length() returns the number of elements in a vector, so if you need to access the last element in a vector, or the one but last, you can proceed as follows: > words[length(words)] [1] "mat" > words[length(words) - 1] [1] "the" > words[(length(words) - 2) : (length(words) - 1)] [1] "on" "the" Note that the left and right arguments of the : operator are included within parenthesis. This is because this operator has a high precedence, and comes into effect before the minus operator. Compare: > (length(words) - 2) : (length(words) - 1) [1] 4 5 > length(words) - 2 : length(words) - 1 [1] In the second case, the vector 9

10 > 2 : length(words) [1] is created first. This vector is then subtracted form length(words), which is automatically expanded to a vector of 5 sixes. From the resulting vector, > length(words) - 2 : length(words) [1] a one is subtracted to give the final result. When an operation is carried out on two vectors that do not have the same length, the shorter one is recycled until it has the same length as the longer vector: > v1 = c(1, 2, 3, 4) > v2 = c(5, 6) > v1 * v2 [1] If you want the elements of your vector to be sorted, use sort(). To reverse the order of the elements, use rev. For a random reordering of the elements of a vector, a permutation, there is sample(): > sort(words) [1] "cat" "mat" "on" "sat" "the" "the" > sort(words)[length(words):1] [1] "the" "the" "sat" "on" "mat" "cat" > rev(sort(words)) [1] "the" "the" "sat" "on" "mat" "cat" > numbers=c(1, 3, 5, 7, 11, 13, 17, 19) > sample(numbers) [1] > numbers = sample(numbers) > numbers [1] > sort(numbers) [1] The function unique removes repeated entries in a vector: > unique(words) [1] "the" "cat" "sat" "on" "mat" > z = rep(numbers, 1:8) > z [1] [19] > unique(z) 10

11 [1] > sort(unique(z)) [1] Note that when we ask R to show the contents of the variable z, we get two lines of output. The first line is preceded by [1], indicating that the first number listed on this line is the first element of z. The second line is preceded by [19], which tells you that the next element of z is the 19th element of this vector. Finally, the table() function tabulates the frequencies with which the items of a vector occur: > table(words) words cat mat on sat the > table(z) z > z.table = table(z) > z.table z The output of the table() function is displayed on three successive lines. First of all, it lists the name of the object for which a table is calculated. In the above examples, these objects were the vectors words and z. On the next line, we find the elements of these vectors, and on the third line, the counts of how often these elements occurred in these vectors Objects The S-language on which R is based is an object-oriented language. Everything that exists in R is an object that has specific methods associated with it. There are different types of objects, and each type of object has it own methods. One object type that we encountered above is the vector. One of the methods associated with a vector is the print method. For vectors, the print method is simple: it displays the contents of the vector, adding the position in the vector for the first element on each new line in the R window. Functions are also objects. Consider, for instance, the function ls(), which lists the contents of your current work space: > ls() [1] "abbreviations" "determiners" "numbers" "words" [5] "x" "y" "z" "z.table" 11

12 As ls() is itself an object, it has a print method. For functions, the print method displays the function code. Therefore, if you type ls to the prompt without the parenthesis, you ask R to print the object on the screen. The code for ls() is too long to repeat here, instead, we show the code for the command to quit R, q(): > q function (save = "default", status = 0, runlast = TRUE).Internal(quit(save, status, runlast)) <environment: namespace:base> > q() Save workspace image? [y/n/c]: If you want to know more details about a function, the on-line help is very useful, just type a question mark followed by the function name. Type?q to the prompt in order to see the details of what you can do with q(). When invoked as function, q() will ask whether the workspace image should be saved. If you respond with yes, the objects in your workspace will be available the next time you start up R with that workspace. Another type of object is produced by the table() function. Let s have a closer look at the table for our words vector. The function names() extracts the names of the elements that have been counted, and the counts themselves, without their names, can be extracted with the function as.numeric(). And you can access the count associated with a name by subscripting with the relevant name. > words.table = table(words) > words.table cat mat on sat the > names(words.table) [1] "cat" "mat" "on" "sat" "the" > as.numeric(words.table) [1] > words.table["the"] the 2 > words.table[c("the", "cat")] words the cat 2 1 It is crucial to keep in mind that subscripting by name always requires double quotes: the name is a string, and should be marked as such. Make sure you understand the following examples. > z.table z 12

13 > z.table[5] 11 5 > z.table["11"] 11 5 > z.table["5"] 5 4 If you subscript with the number 5, you ask for the fifth element, which has label 11 and the value 5. It is often much clearer to address the table by the name itself, in which case you need double quotes. So with z.table["11"] you ask for the count of elevens in the vector. Up till now, we have created vectors in R. In order to load already existing data files with numbers or strings into R, there is the function scan(). > n = scan (file = "DATA/numbers.txt") Read 12 items > w = scan(file = "DATA/words.txt", what = "character") Read 8 items > n [1] > w [1] "an" "example" "of" "a" "file" [6] "with" "some" "words" Here is a summary of the functions and operators discussed thus far. Make sure you are confident about what they do before you read on. arithmetic * / sqrt() creating vectors : c() seq() rep() loading vectors scan() reordering vector elements rev() sort() sample() summarizing vectors length() unique() table() vector indexes table objects which() names(), as.numeric() general q(), ls() 13

14 1.1.4 Matrices Just as it is often quite handy to bring numbers together in a vector, it is often the case that we will need to bring vectors together into tables. The functions cbind() and rbind() bind vectors by column and by row respectively: > a = c(1, 2, 3, 4) > b = c(5, 6, 7, 8) > c = c(9, 10, 11, 12) > A = cbind(a, b, c) > A a b c [1,] [2,] [3,] [4,] > B = rbind(a, b, c) > B [,1] [,2] [,3] [,4] a b c Tables of numbers such as A and B are referred to as matrices. Note that when you use cbind, the names of the vectors appear as column labels, while for rbind(), they appear as row labels. As with vectors, we will often need to access specific elements, or rows, or columns of a matrix. To do so, we use the same subscripting mechanism with square brackets, but we now separate rows and columns by means of a comma. Information preceding the comma pertains to rows, information following the comma concerns columns. Have a look at the these examples of subscripting > A[1, ] # select the first row a b c > A[2, ] # select the second row a b c > A[, "c"] # select the column labelled "c" [1] > A[, 3] # select the third column [1] > B["b", ] # select the row labelled "b" [1]

15 > B[, 2] # select the second column a b c > C = B[1:2, 4:2] > C [,1] [,2] [,3] a b Note that the matrix C inherited the row labels from matrix B. The dimensions of a matrix can be queried with the function dim(): > dim(c) [1] 2 3 What dim() returns is a vector with two elements specifying the number of rows and the number of columns. When we created the matrices A and B, we did so by first creating individual vectors, which we then combined. When you have a vector that you want to reformat into a matrix, you can use the matrix() function: > n # we made this vector above [1] > D = matrix(n, 6, 2) # a matrix of 6 rows and 2 columns > D [,1] [,2] [1,] 2 14 [2,] 4 16 [3,] 6 17 [4,] 8 18 [5,] [6,] > E = matrix(n, 4, 3) # a matrix of 4 rows and 3 columns > E [,1] [,2] [,3] [1,] [2,] [3,] [4,] > F = matrix(0, 2, 2) # a 2 by 2 matrix of zeros > F [,1] [,2] [1,] 0 0 [2,]

16 You can add or subtract matrices that have the same dimensions. If you multiply a matrix with a number, each element of the matrix is multiplied by that number. The same principle applies to arithmetic functions such as sqrt() or log(). > E * 2 [,1] [,2] [,3] [1,] [2,] [3,] [4,] > E + 2*E [,1] [,2] [,3] [1,] [2,] [3,] [4,] > log(e + 1) # natural logarithm [,1] [,2] [,3] [1,] [2,] [3,] [4,] Vectors of strings can also be combined into tables, but these are referred to as arrays and not as matrices: > X = rbind(c("this", "is"), c("an", "array"), c("and", "not"), + c("a", "matrix")) > X [,1] [,2] [1,] "this" "is" [2,] "an" "array" [3,] "and" "not" [4,] "a" "matrix" > X[3, 1] # extract the 1st element on 3rd row [1] "and" The plus on the second line of this example is the prompt that R gives instead of the > when the command on the previous line is not complete Data frames The elements of an array and a matrix should all be of the same type. When you try to combine vectors of different types, one of the vectors will be converted to the type of the other, as in the following example: 16

17 > words.table # we made this table above words cat mat on sat the > wrong = cbind(names(words.table), as.numeric(words.table)) > wrong [,1] [,2] [1,] "cat" "1" [2,] "mat" "1" [3,] "on" "1" [4,] "sat" "1" [5,] "the" "2" > rm(wrong) # delete wrong from the workspace The second column of wrong is not a vector of numbers, but a vector of strings, so we cannot do any numerical operations on this vector any more. Fortunately, R provides a special kind of table in which you can bring together vectors of different types. This data type is known as a data frame. Here is how you can create a data frame to replace wrong: > right = data.frame(words = names(words.table), + frequency = as.numeric(words.table)) > right words frequency 1 cat 1 2 mat 1 3 on 1 4 sat 1 5 the 2 We supplied two vectors to the data.frame() function, which we named words and frequency. These names appear as the column labels of the data frame right. There are three ways in which you can access the columns of a data frame, and two ways to access its rows. > right[, 2] # the second column [1] > right[,"frequency"] # the column labelled "frequency" [1] > right$frequency # the $ operator saves typing [1] > right["1", ] # the row with the rowname "1" words frequency 1 cat 1 > right[1, ] # the first row 17

18 words frequency 1 cat 1 New is the $ operator, which provides a convenient way for addressing columns. Data frames have both rownames and column names, which can be extracted with the functions rownames() and colnames(): > rownames(right) [1] "1" "2" "3" "4" "5" > colnames(right) [1] "words" "frequency" The outputs of rownames() and colnames() are themselves vectors, and the lengths of these vectors are identical to the dimensions returned by dim(): > length(rownames(right)) == dim(right)[1] [1] TRUE > length(colnames(right)) == dim(right)[2] [1] TRUE In order to illustrate the use of data frames, let s consider a real example of a dataset studied by Baayen and Hay [2004]. Baayen and Hay were interested in the extent to nonlinguistic cognition is affected by one s language. More specifically, they investigated to what extent one s knowledge of the names for objects and the lexical properties of these names influence the way we think about these objects. The way they addressed this question is by means of an experiment with 81 concrete words, names for animals as well as fruits, nuts and vegetables. They asked 20 subjects to indicate, for each of these words, on a seven point scale how heavy they thought the word s referent was. The result is a data set with 20 * 81 = 1620 subjective weight estimates. The experimental data are available in the DATA directory as weight.ratings.txt. This file has 1620 lines, one for each rating elicited for a given subject and a given word. A given line also lists the sex of the speaker, as well as many different lexical variables. We load this data file into R with read.table(): > weight = read.table("data/weight.ratings.txt", header = TRUE) The option header = TRUE specifies that the first line in weight.ratings.txt is a header that specifies the names for the different columns. This option can be abbreviated to T. It makes no sense to type weight to the R prompt, as weight is a very large object, as can be seen with dim(): > dim(weight) [1] It is much more informative to inspect, say, the first four lines, 18

19 weight[1:4,] Subject Rating Trial Sex Word Frequency FamilySize 1 A1 5 1 F horse A1 1 2 F gherkin A1 3 3 F hedgehog A1 1 4 F bee SynsetCount Length Class FreqSingular FreqPlural animal plant animal animal DerivEntropy Complex rinfl meanrt SubjFreq meansize simplex simplex simplex simplex BNCw BNCc BNCd BNCcRatio BNCdRatio or to ask for the column names: > colnames(weight) [1] "Subject" "Rating" "Trial" [4] "Sex" "Word" "Frequency" [7] "FamilySize" "SynsetCount" "Length" [10] "Class" "FreqSingular" "FreqPlural" [13] "DerivEntropy" "Complex" "rinfl" [16] "meanrt" "SubjFreq" "meansize" [19] "BNCw" "BNCc" "BNCd" [22] "BNCcRatio" "BNCdRatio" The first column lists the (anonymized) subjects in the experiment, each of which contributed 81 ratings: > table(weight$subject) A1 A2 G H I1 I2 J K L M1 M2 P R1 R2 R3 R4 R5 S1 S2 T In order to obtain just the names of the different subjects, we have two options. > unique(weight$subject) 19

20 [1] A1 A2 G H I1 I2 J K L M1 M2 P R1 R2 R3 R4 R5 S1 [19] S2 T1 20 Levels: A1 A2 G H I1 I2 J K L M1 M2 P R1 R2 R3 R4... T1 > levels(weight$subject) [1] "A1" "A2" "G" "H" "I1" "I2" "J" "K" "L" "M1" "M2" [12] "P" "R1" "R2" "R3" "R4" "R5" "S1" "S2" "T1" Note that when we apply unique() to the column of the data frame listing the subjects, we not only obtain the names of the subjects, but also some additional information, namely, that there are twenty levels that are subsequently listed in summary form. The reason for this additional information is that string vectors in a data frame are converted automatically to factors. A factor in statistics is a variable that has strings as its possible values. Another factor in the weight data frame is Sex, which has as its levels F (female) and M (male). The distinction between a string vector and a factor is crucial for many of the tools that we will be using in later chapters. The simplest way to see what subjects participated is to use the function that simply returns the levels of a factor, levels(). A related summary function is nlevels(), which returns the number of levels: > nlevels(weight$subject) [1] 20 The column labelled Word specifies the word for which a rating was elicited. We extract a list of these words with levels(): > levels(weight$word) [1] "almond" "ant" "apple" "apricot" [5] "asparagus" "avocado" "badger" "banana" [9] "bat" "beaver" "bee" "beetroot" [13] "blackberry" "blueberry" "broccoli" "bunny" [17] "butterfly" "camel" "carrot" "cat" [21] "cherry" "chicken" "clove" "crocodile" [25] "cucumber" "dog" "dolphin" "donkey" [29] "eagle" "eggplant" "elephant" "fox" [33] "frog" "gherkin" "goat" "goose" [37] "grape" "gull" "hedgehog" "horse" [41] "kiwi" "leek" "lemon" "lettuce" [45] "lion" "magpie" "melon" "mole" [49] "monkey" "moose" "mouse" "mushroom" [53] "mustard" "olive" "orange" "owl" [57] "paprika" "peanut" "pear" "pig" [61] "pigeon" "pineapple" "potato" "radish" [65] "reindeer" "shark" "sheep" "snake" [69] "spider" "squid" "squirrel" "stork" [73] "strawberry" "swan" "tomato" "tortoise" 20

21 [77] "vulture" "walnut" "wasp" "whale" [81] "woodpecker" The third column of weight lists the trial number, if a word has trial number 4, it was the fourth word in the experiment that a given subject was asked to rate. This variable ranges from 1 to 81, and is a control variable that allows us to trace possible effects of learing or fatigue that might take place in the course of the experiment. The remaining 18 columns specify various properties of the words. These are briefly described in the appendix. Variables of interest for the present example are a word s frequency (Frequency), its family size (the number of complex words in which it appears as a constituent, FamilySize), the number of synonym sets (synsets) in which it is listed in WordNet [Miller, 1990, Beckwith et al., 1991, Fellbaum, 1998], its length in letters (Length), its Class (plant or animal), and its derivational entropy, a token-weighted variant of the family size count, [Moscoso del Prado Martín et al., 2004]. All lexical variables for a given word are repeated on twenty rows of weight, once for each subject. In order to obtain a data frame that is restricted to the information pertaining only to the items, we use the unique() function as follows: > items = unique(weight[, 5:23]) # skip columns with information > dim(items) # specific to subject and trial [1] The first four columns of weight contain information about the subject and the trial itself. This is information we want to discard in order to obtain a data frame that summarizes the properties of the words in the experiment. From column 5 onwards, we have the information that is specific to the items. The subsection of the data frame obtained by weight[, 5:23] has 81 unique lines, each of which is repeated 20 times, once for each subject. With unique, we remove the redundant lines, and retain exactly one instance of each unique line. This data frame still has more columns then we need at this moment. Let s consider how we can create a smaller data frame with exactly the relevant information: > items = items[,c(1:6,9)] # or > items = items[,c("word", "Frequency", "FamilySize", + "SynsetCount", "Length", "Class", "DerivEntropy")] > items[1:4,] Word Frequency FamilySize SynsetCount Length Class 1 horse animal 2 gherkin plant 3 hedgehog animal 4 bee animal DerivEntropy

22 The first two commands select exactly the same subset of columns. The first uses the column numbers, the second the column names. It is sometimes convenient to relabel column names or rownames. For instance, we can rename the rownames with the words, > rownames(items) = as.character(items$word) > items[1:4,1:6] Word Frequency FamilySize SynsetCount Length Class horse horse animal gherkin gherkin plant hedgehog hedgehog animal bee bee animal which makes it easy to extract information from the data frame by name: > items["bat", "Frequency"] [1] > items["pig", "Length"] [3] 3 The function as.character() used above converts the factor items$word to a vector of strings. This conversion is necessary because the row names and the column names are vectors of strings and not factors. It is very important to become fluent in subscripting data frames. You should keep in mind that restrictions on rows precede the comma, and restrictions on columns follow the comma in the subscripting sequence [,]. Here are some examples: > items[items$length == 3, 2:4] Frequency FamilySize SynsetCount bee pig fox bat dog owl cat ant > items2 = items[items$length > 5 & items$length < 7, c(1,3)][1:5,] > items2 Word FamilySize peanut peanut pigeon pigeon tomato tomato

23 donkey donkey magpie magpie > items[items$length > 7 items$length == 3, + c("word", "FamilySize")][1:5,] Word FamilySize hedgehog hedgehog bee bee pineapple pineapple blackberry blackberry tortoise tortoise The second and third example illustrate the logical connectives and (&) and or ( ). They also illustrate that you can subscript the part of the data frame that you just subscripted. After all, the result of subscripting a dataframe is a new, smaller dataframe, that can in turn be subscripted. Subscripting with [1:5, ] is a convenient way for inspecting the first couple of lines of a data frame. You sort the lines in a data frame with the function order(): > items2[order(items2$word),] Word FamilySize donkey donkey magpie magpie peanut peanut pigeon pigeon tomato tomato When you call order() with items2$word as argument, it returns a vector with the row numbers of the words such that the words themselves are sorted: > order(items2$word) [1] When this vector is inserted in the row slot of the subscript of items2, its rows are rearranged in this order. When order is supplied with more than one argument, it will sort on the first argument, and resolve ties by looking at the second argument, or the third, if required, and so on. > items2[order(items2$familysize, items2$word),] Word FamilySize donkey donkey magpie magpie tomato tomato peanut peanut pigeon pigeon

24 1.1.6 Random variables Thus far, we have encountered two kinds of vectors: numerical vectors and factors. Both are used to represent random variables. A random variable is the outcome of an experiment. Here are some examples of experiments and their associated random variables: tossing a coin a random variable with values head or tail. throwing a dice a random variable with values 1, 2,..., 6. counting words a random variable with as values the frequency of occurrence in some corpus: 0, 1, 2,..., N (with N the size of the corpus). familiarity rating the subjective estimate of frequency, usually on a scale of 1 to 7, is the random variable in this experiment. lexical decision this kind of experiment has two associated random variables: the accuracy of a response (with levels correct and incorrect ) and the latency of the response (in milliseconds). A random variable is random in the sense that the outcome of a given experiment is not fully predictable. The opposite of a random variable is a constant. The size of a given corpus such as the Brown corpus [Kučera and Francis, 1967] in word tokens is fixed, hence by itself this specific corpus size is not a random variable. On the other hand, corpus size, defined as ranging over many different corpora, is a random variable, because we cannot say what the corpus size is without being told what corpus we are dealing with. The art of statistics is to learn from prior experience with a given random variable (or sets of random variables) in order to optimize one s predictions as to what the most likely value of a random variable is Summary Before starting with the section on visualization, make sure you are confident about the use of the functions that were introduced in this section. creating tables cbind() rbind() matrix() data frames $ data.frame() rownames() colnames() read.table() properties dim() sorting order() selecting unique() type conversion as.character() general rm() 24

25 1.2 Visualization An important first step in exploratory data analysis is to inspect your data graphically. It is difficult and often downright impossible to make sense of large tables of numbers. But patterns in the data often become visible thanks to the tools for data visualization that are now available. We first discuss tools for visualizing properties of single random variables (in vectors and uni-dimensional tables), and then proceed with an overview of tools for graphing groups of random variables (typically brought together in matrices or data frames) Visualizing single random variables Bar plots and histograms are useful for obtaining visual summaries of the distributions of random variables. Figure 1.1 illustrates this for the numeric variables in the items data frame that describes the main properties of the words used in the rating experiment eliciting subjective estimates of the referent s weight. This figure has six panels arranged in a matrix of three rows and two columns. In order to instruct R to make such a matrix of plots, we have to set the appropriate graphics parameter, mfrow to the vector c(3, 2) using the function par(), which controls a large number of graphical parameters. Plots will be added one by one to the plot region, proceeding row by row from left to right. par(mfrow=c(3,2)) The upper left panel is a bar plot of the counts of word lengths: barplot(table(items$length), xlab="word length", col="grey") The option xlab sets the label for the X axis, and with the option col we set the color for the bars to grey. We see that word lengths range from 3 to 10, and that the distribution is somewhat asymmetric, with a mode (the value observed most often) at 5. The mean is 5.9, and the median is 6. (The median is obtained by ordering the observations from small to large, and then taking the value for which 50% of the data points are smaller.) Mean, median, and range are easy to extract with the corresponding functions mean(), median(), and range(): > mean(items$length) [1] > median(items$length) [1] 6 > range(items$length) [1] 3 10 There are also separate functions for extracting the minimum and the maximum: 25

26 > min(items$length) [1] 3 > max(items$length) [1] 10 The upper right panel of Figure 1.1 shows the histogram corresponding to the bar plot in the upper left panel. The main difference between the bar plot and the histogram is that the latter is scaled on the vertical axis in such a way that the total area of the bars is equal to 1. This allows us to see that the probability of the word lengths 5 and 6 jointly is close to 0.5. This histogram was produced with the truehist() function in the MASS library of Venables and Ripley [2003], that is part of any recent distribution of R. In order to access this function, we need to load this library with the library() function, library(mass) after which we can produce the histogram in the upper left panel with truehist(items$length, xlab="word length", col="grey") The remaining panels of Figure 1.1 were made in the same way. truehist(items$frequency, xlab = "log word frequency", col = "grey") truehist(items$synsetcount, xlab = "log synset count", col = "grey") truehist(items$familysize, xlab = "log family size", col = "grey") truehist(items$deriventropy, xlab = "derivational entropy", col = "grey") Note that the bottom panels show highly skewed distributions: Most of the words in this experiment have no morphological family members at all. Now that all panels have been filled, we reset the graphics parameter to one figure in the plot region: par(mfrow=c(1, 1)) There are several ways in which plots can be saved as independent graphics files: as png or jpeg files, or as PostScript files. The corresponding functions are png(), jpeg(), and postscript(). We illustrate how these functions work for PostScript. > postscript("barplot.ps", horizontal = FALSE, he = 6, wi = 6, + family = "Helvetica", paper = "special", onefile = FALSE) > truehist(items$frequency, xlab = "log word frequency") > dev.off() The first argument of postscript() is the name of the PostScript file to be created. Whether the plot should be in portrait or landscape mode is controlled by the horizontal argument. The parameters he and wi control the height and width of the plot in inches. The font to be used is specified by family, and with paper="special" the output will be an encapsulated PostScript file that can be easily incorporated in, for instance, a L A TEX document. The final argument, onefile, is set to FALSE in order to indicate there is 26

27 word length word length log word frequency log synset count log family size derivational entropy Figure 1.1: A bar plot and histograms for the variables describing the lexical properties of the words used in the weight rating experiment. 27

28 only a single plot in the file. There are many more options, check the on-line help for further details. The postscript() command opens a PostScript file, and all following plot commands are diverted to this PostScript file. To close the postscript file, you use the function dev.off(). After this command, new plots will appear in the graphics window as usual. The dev.off() command is crucial: If you forget to close your file, you will run into all sorts of trouble when you try to view the file outside R, or if you try to make a new figure in R. The shape of a histogram depends, sometimes to a surprising extent, on the width of the bars and on the position of the left side of the first bar. The function truehist() has defaults that are chosen to minimize the risk of obtaining a rather arbitrarily shaped histogram (see also Haerdle, 1991). A function that further reduces this risk is density(). We illustrate this function for the reaction times elicited in a visual lexical decision experiment using the same words as in the weight rating experiment. In a visual lexical decision experiment, words are presented on a computer screen together with non-existing words like sulp. Subjects are asked to indicate as quickly as possible by means of two push buttons whether the letter string presented on the screen in a real word. The time between the moment that the word is displayed on the screen and the moment at which a button response is recorded is the reaction time (also referred to as response latency). It is a measure of the complexity of lexical processing which is known to be co-determined by a wide range of lexical variables. The reaction times for 79 of the 81 words discussed above are available in the DATA directory as the text file lexdec.txt. Further details about the variables in this data set, see the appendix. The upper left panel of Figure 1.2 shows the histogram as given by truehist() applied to the (logarithmically transformed) reaction times. lexdec = read.table("data/lexdec.txt",t) truehist(lexdec$rt, col="lightgrey", xlab="log RT") The distribution of reaction times is somewhat skewed, with an extended right tail of very long latencies. The upper right panel of Figure 1.2 shows the histogram as produced by hist() instead of truehist(), together with the density curve. The two have roughly the same shape, but the density curve smoothes the discrete jumps of the histogram. The lower left panel uses hist(), but now with the same bin widths as truehist(). The histogram and the density curve are now very similar estimates of the distribution of reaction times. Plotting the upper right and lower left panels requires some careful preparation in order to make sure that the ranges of values for the two axes are set properly to accomodate both the histogram and the density function. We therefore begin with the standard function for making a histogram, hist(), which we force to make the same bins as truehist() in the case of the lower left panel, by specifying the breaks (the points where new bins should begin) explicitly. Instead of plotting the histogram, we save it, so that we can extract the range of values for the horizontal and vertical axes. > h = hist(lexdec$rt, freq = FALSE, plot = FALSE, 28

29 log RT log RT log RT Figure 1.2: Histograms and density function for the response latencies of 21 subjects to 79 nouns referring to animals and plants (fruits and vegetables). + breaks = seq(5.8, 7.6, by = 0.1)) # lower left panel We then repeat this procedure for the density curve, > d = density(lexdec$rt) and then set the X and Y limits: > xlimit = range(h$breaks, range(d$x)) > ylimit = range(0, h$density, d$y) Finally, we plot the histogram, and add the curve for the density with the function lines(). The function lines() takes a vector of x coordinates and a vector of y coordinates, and connects the points specified by these coordinates with a line (in the order specified by the input vectors). > hist(lexdec$rt, freq = FALSE, col = "lightgrey", + border = "darkgrey", ylab = "", xlab = "log RT", + xlim = xlimit, ylim = ylimit, main = "", + breaks = seq(5.8, 7.6, by = 0.1)) > lines(d) The border option of hist() controls the color of the lines marking the bars of the histogram. We prevent hist() from adding a main plot title by setting main to the 29