Exploratory data analysis. An introduction to R for the language sciences

Size: px
Start display at page:

Download "Exploratory data analysis. An introduction to R for the language sciences"

Transcription

1 Exploratory data analysis. An introduction to R for the language sciences R. H. Baayen Interfaculty research unit for language and speech, University of Nijmegen, & Max Planck Institute for Psycholinguistics, Nijmegen baayen@mpi.nl course materials Helsinki, version December 17, 2004 Introduction This book provides an introduction to the statistical analysis of quantitative data for researchers studying aspects of language and language processing. For many of my colleagues, the statistical analysis of quantitative data is an onerous task that they would rather leave to others. They tend to use statistical packages as a kind of oracle, from which you elicit a verdict as to whether you have one or more significant effects in your data. In order to elicit a response from the oracle, one has to click one s way through cascades of menues. After a magic button press, voluminous output tends to be produced that hides the p-values, the ultimate goal of the statistical pelgrimage, among lots of other numbers that are completely meaningless to the user, as befits a true oracle. The approach to data analysis to which this book provides a guide is fundamentally different in several ways. First of all, we will make use of a radically different tool for doing statistics, the interactive programming environment known as R. R is an open source implementation of the S language and environment for data analysis originally developed at Bell Laboratories. Learning to work with R is in many ways similar to learning a new language. Once you have mastered its grammar, and once you have acquired some basic vocabulary, you will also have begun to acquire a new way of thinking about data analysis that is essential for understanding the structure in your data. The design of R is especially elegant in that it has a consistent uniform syntax for specifying statistical models, no matter which type of model is being fitted. What is essential about working with R, and this brings us to the second difference in our approach, is that we will depend heavily on visualization. R has outstanding graphical facilities, which generally provide far more insight into the data than long lists of statistics that depend on often questionable simplifying assumptions. That is, this book provides an introduction to exploratory data analysis. Moreover, we will work incrementally and interactively. R is an object-oriented programming language. If you are not familiar with this term, you can think of R as a language in which a statistical model is an object, an object created by the researcher to capture the structure in the data. Once the object is created, there are many things you can do with that object. You can summarize the object in order to inspect parameters and their p-values, or you can plot the object in order to see how well it fits the data. Or you can update the object, or extract predictions from 1

2 the object, and so on. The process of understanding the structure in your data is almost always an iterative process involving graphical inspection, model building, graphical inspection, updating and adjusting the model, etc. The flexibility of R is crucial for making this iterative process both easy and enjoyable. A third, at first sight heretical aspect of this book is that we have avoided all formal maths. The focus of this book is on explaining the key concepts and on providing guidelines for the proper use of statistical techniques. A useful metaphor is learning to drive a car. In order to drive a car, you need to know the position and function of tools such as the steering wheel and the brake. You also need to know that you should not drive with the hand brake on. And you need to know the traffic rules. Without these three kinds of knowledge, driving a car is extremely dangerous. What you do not need to know is how to construct a combustion engine, or how to drill for oil and refine it so that you can use it to fuel a combustion engine. The aim of this book is to provide you with a driving licence for exploratory data analysis. There is one caveat here. To stretch the metaphor to its limit: With R, you are receiving driving lessons in an all-powerful car, a combination of a racing car, a lorry, a personal vehicle, and a limousine. Consequently, you have to be a responsible driver, which means that you will often find that you will need additional driving lessons beyond those offered in this book. Moreover, it never hurts to consult professional drivers statisticians with a solid background in mathematical statistics who know the ins and outs of the tools and techniques, and their advantages and disadvantages. Finally, the approach we have taken in this course is to work with real data sets rather than with small artificial examples. Real data are often messy, and it is important to know how to proceed when the data display all kinds of problems that standard introductory textbooks hardly ever mention. An important reason for using R is that it is a carefully designed programming environment that allows you, in a very flexible way, to write your own code, or modify existing code, to tailor R to your specific needs. Moreover, you can call R from another program (e.g., from scripting languages such as Python, Perl, or AWK) so that you do not have to do repeated similar analysis by hand one by one. To see why this is useful, consider a researcher studying similarities in meaning and form for a large number of words. Suppose that a separate model needs to be fitted for each of 1000 words to the data of the other 999 words. If you are used to thinking about statistical question as paths through cascaded menues, you will discard such an analysis as impractical almost immediately. When you work in R, it is a piece of cake, because you can write the code for one word, and then cycle it through all other words. If all the data is available at once, you could do this within R. If the data become available one chunk at the time, and if the joint data set is too large to load into R all at once, you can call R from another program to analyse the separate chunks. We have seen many instances of researchers being limited in the questions they explored because they were thinking in menu-driven language instead of in an interactive programming language like R. This is an area where language determines thought. If you are new to working with a programming language, you will find that you will 2

3 have to get used to getting your commands for R exactly right. Every comma, apostrophy, and bracket is important, and a single mistyped character will cause R to break and respond with a warning. There are command line editing facilities, and you can page through earlier commands with the up and down arrows of your keyboard. It is often more useful, however, to open a simple text editor (emacs, gvim, notepad), to prepare your commands in the editor, and to copy and paste finished commands into the R window. Especially more complex commands tend to be used more than once, and it is often much easier to make copies in the editor and modify these, then to try to edit multipleline commands in the R window itself. Output from R that is worth remembering can be pasted back into the editor, which in this way retains a detailed history of both your commands and of the relevant results. There are several ways in which you can use this book. If you use this book as an introduction to statistics, it is important to work through the examples, not only by reading them through, but by trying them out in R. Each chapter also comes with a set of problems, the solutions to these problems are provided at the end of the book. If you use this book to learn how to apply in R particular techniques that you are already familiar with, then the quickest way to proceed is to study the structure of the relevant data files used to illustrate the technique. Once you have understood how the data are to be formatted, you can load the data into R and try out the example. Once you have got this working, it should not be difficult to try out the same technique on your own data. This book is organized as follows. The first chapter describes the basics of the language, simple data structures, loading data, and exploring the structure in the data using various visualization techniques. The second chapter provides an introduction to random variables, distributions, and standard statistical tests for single random variables as well as tests for two random variables. Chapter 3 is an overview of exploratory techniques for clustering and classification. Chapter 4 introduces multiple regression, including analysis of (co)variance and multilevel modeling. 3

4 4

5 Chapter 1 Calculating and plotting in R In order to learn to work with R, you have to learn to speak its language. The grammar of the R language is beautifully and easy to learn. It is important to master the basics of R s grammar, as this grammar is designed to help you think about your data in a way that shows the way as to how you might want to analyse them. In this chapter, we begin with very simple examples that show you how to talk with R. As soon as possible, however, we will begin to use examples from a large experimental data set. When you start R, for instance, by typing R to the prompt of your Linux console, R responds by providing you with its own prompt (>). It also checks whether there is a file named.rdata in your current directory. If there is no such file, it creates one. If there already is such a file, indicating that you have worked on a problem in this directory before, it will load that file and make the objects stored in that file available to you. It is advisable to separate different projects in different directories, in order to avoid your workspace to become cluttered with lots of different unrelated objects. When you work with large data sets and complex objects in R, this is also the way to avoid that your.rdata file becomes unmanageably large. The way to learn a language is to start speaking it. The way to learn R is to use it. Reading through the examples in this chapter is not enough to become a confident user of R. For this, you need to actually try out the examples, by typing them at the R prompt. You have to be very precise in your commands, which requires a discipline that you will only master if you learn from experience, and learn from your mistakes. Don t be put off if R complains about your initial attempts to use it, just carefully compare what you typed, letter by letter and bracket by bracket, with the code in the examples. 1.1 Calculating with R Numbers and strings Once you have an R window, you can use R as an (overgrown) calculator, as shown in the following examples: 5

6 > # addition [1] 3 > 2 * 3 # multiplication [1] 6 > 6 / 3 # division [1] 2 > 2ˆ3 # power [1] 8 > sqrt(9) # square root [1] 3 > sqrt(9)ˆ3 [1] 27 Note, first of all, that R provides the answer as soon as you hit the return key. There is no need to supply the equal sign, and in fact you should not try to supply one. Second, the answers to all these examples are preceded by a [1]. We will return to why this is shortly. Third, the output of one expression sqrt(9) can serve immediately as the input of a second expression ˆ 3. You can save the output of any calculation in variables such as x or y by using the equal sign =, the assignment operator. You can also use the variables in calculations in the same way as numbers: > x = # assignment > x # request to display value [1] 3 > y = sqrt(16) # assignment > y # another request to display value [1] 4 > x ˆ y # working with two variables [1] 81 If you type the name of a variable, say x, to the R prompt, it returns the corresponding value, 3 in this example. It is also possible to store sequences of letters, technically known as strings, in a variable: > w = "word" > w [1] "word" Note that strings are enclosed between double quotes Vectors In most of your work with R, you will not be dealing with single numbers, or single strings, but with groups of numbers, or groups of strings. For instance, you might want 6

7 to add 1 to each of the numbers 1, 2, 3, and 4. The way to do this in R is to first combine these numbers into an ordered list, a vector, by means of the combination function c(): > x = c(1, 2, 3, 4) # combining numbers into a vector > x [1] Now that we have combined the numbers 1, 2, 3 and 4 into a vector, we can add 1 to each element as follows: > x + 1 # add one to each vector element [1] > y = x + 1 # store result in y > y # display y [1] Vectors are very useful when the same calculations have to be carried out on many pairs of numbers: > x = c(1, 2, 3, 4) > y = c(5, 6, 7, 8) > x + y [1] > x * y [1] > xˆ2 [1] > xˆ2 * y [1] In these examples, you can conceptualize x and y as two columns of numbers. Calculations are performed on the numbers in these columns that are on the same row. (It will be helpful to think of vectors as columns rather than as rows for many tools in R.) Vectors are fundamental to most of what you will be doing with R, and we will therefore discuss a number of ways for creating vectors, and for accessing the elements of a vector. The operator : creates an ascending or descending sequence of whole numbers (integers): > 1:4 [1] > 4:1 [1] > c(1:4, 4:1) [1] A more flexible sequencing function is seq, which gives you control of the increment: 7

8 > seq(1, 2, 0.1) [1] > seq(2, 0, -0.5) [1] The function rep is useful for repeating single numbers but also vectors: > rep(1, 4) [1] > rep(1:4, 4) [1] > rep(1:4, 4:1) [1] > rep(seq(1, 4, 0.5), 2) [1] Note that the second argument of rep specifies the number of repetitions. If this second argument is itself a vector, it specifies the number of repetitions required for each of the elements of the first argument, which should then itself be a vector. In the examples thus far, we have only examined vectors whose elements were numbers. However, vectors can also be constructed for strings, and the functions c() and rep() work just as before: > determiners = c("the", "an", "a") > determiners [1] "the" "an" "a" > abbreviations = c("1st", "2nd", "3rd") > abbreviations [1] "1st" "2nd" "3rd" > c(determiners, abbreviations) [1] "the" "an" "a" "1st" "2nd" "3rd" > rep(determiners, 3:1) [1] "the" "the" "the" "an" "an" "a" When presented with a vector, it is often necessary to access specific elements from that vector. This is done by means of a mechanism called subscripting. The position of the element to be extracted from the vector is added after the vector name between square brackets. When more than one element needs to be extracted, a vector of positions can be used instead of a single number. Here are some examples that illustrate the key principles. > determiners[1] [1] "the" > determiners[2] [1] "an" 8

9 > determiners[c(1, 3)] [1] "the" "a" > determiners[3:1] [1] "a" "an" "the" It is also possible to subscript with a condition that has to be met: > words = c("the", "cat", "sat", "on", "the", "mat") > words[words == "the"] # show the elements equal to "the" [1] "the" "the" > which(words == "the") # show the positions of these elements [1] 1 5 > words[which(words == "the")] [1] "the" "the" > words =="the" [1] TRUE FALSE FALSE FALSE TRUE FALSE # a vector of booleans > booleans = words == "the" > words[booleans] [1] "the" "the" When you subscript a vector with a vector (that should be of the same length), the result is a vector with those elements that correspond to the elements with the value TRUE in the boolean vector. The function which() is available for extracting the indexes of those elements in a vector that meet a given condition. Note that a double equal sign, ==, denotes equality, while a single equal sign is the assignment operator. The function length() returns the number of elements in a vector, so if you need to access the last element in a vector, or the one but last, you can proceed as follows: > words[length(words)] [1] "mat" > words[length(words) - 1] [1] "the" > words[(length(words) - 2) : (length(words) - 1)] [1] "on" "the" Note that the left and right arguments of the : operator are included within parenthesis. This is because this operator has a high precedence, and comes into effect before the minus operator. Compare: > (length(words) - 2) : (length(words) - 1) [1] 4 5 > length(words) - 2 : length(words) - 1 [1] In the second case, the vector 9

10 > 2 : length(words) [1] is created first. This vector is then subtracted form length(words), which is automatically expanded to a vector of 5 sixes. From the resulting vector, > length(words) - 2 : length(words) [1] a one is subtracted to give the final result. When an operation is carried out on two vectors that do not have the same length, the shorter one is recycled until it has the same length as the longer vector: > v1 = c(1, 2, 3, 4) > v2 = c(5, 6) > v1 * v2 [1] If you want the elements of your vector to be sorted, use sort(). To reverse the order of the elements, use rev. For a random reordering of the elements of a vector, a permutation, there is sample(): > sort(words) [1] "cat" "mat" "on" "sat" "the" "the" > sort(words)[length(words):1] [1] "the" "the" "sat" "on" "mat" "cat" > rev(sort(words)) [1] "the" "the" "sat" "on" "mat" "cat" > numbers=c(1, 3, 5, 7, 11, 13, 17, 19) > sample(numbers) [1] > numbers = sample(numbers) > numbers [1] > sort(numbers) [1] The function unique removes repeated entries in a vector: > unique(words) [1] "the" "cat" "sat" "on" "mat" > z = rep(numbers, 1:8) > z [1] [19] > unique(z) 10

11 [1] > sort(unique(z)) [1] Note that when we ask R to show the contents of the variable z, we get two lines of output. The first line is preceded by [1], indicating that the first number listed on this line is the first element of z. The second line is preceded by [19], which tells you that the next element of z is the 19th element of this vector. Finally, the table() function tabulates the frequencies with which the items of a vector occur: > table(words) words cat mat on sat the > table(z) z > z.table = table(z) > z.table z The output of the table() function is displayed on three successive lines. First of all, it lists the name of the object for which a table is calculated. In the above examples, these objects were the vectors words and z. On the next line, we find the elements of these vectors, and on the third line, the counts of how often these elements occurred in these vectors Objects The S-language on which R is based is an object-oriented language. Everything that exists in R is an object that has specific methods associated with it. There are different types of objects, and each type of object has it own methods. One object type that we encountered above is the vector. One of the methods associated with a vector is the print method. For vectors, the print method is simple: it displays the contents of the vector, adding the position in the vector for the first element on each new line in the R window. Functions are also objects. Consider, for instance, the function ls(), which lists the contents of your current work space: > ls() [1] "abbreviations" "determiners" "numbers" "words" [5] "x" "y" "z" "z.table" 11

12 As ls() is itself an object, it has a print method. For functions, the print method displays the function code. Therefore, if you type ls to the prompt without the parenthesis, you ask R to print the object on the screen. The code for ls() is too long to repeat here, instead, we show the code for the command to quit R, q(): > q function (save = "default", status = 0, runlast = TRUE).Internal(quit(save, status, runlast)) <environment: namespace:base> > q() Save workspace image? [y/n/c]: If you want to know more details about a function, the on-line help is very useful, just type a question mark followed by the function name. Type?q to the prompt in order to see the details of what you can do with q(). When invoked as function, q() will ask whether the workspace image should be saved. If you respond with yes, the objects in your workspace will be available the next time you start up R with that workspace. Another type of object is produced by the table() function. Let s have a closer look at the table for our words vector. The function names() extracts the names of the elements that have been counted, and the counts themselves, without their names, can be extracted with the function as.numeric(). And you can access the count associated with a name by subscripting with the relevant name. > words.table = table(words) > words.table cat mat on sat the > names(words.table) [1] "cat" "mat" "on" "sat" "the" > as.numeric(words.table) [1] > words.table["the"] the 2 > words.table[c("the", "cat")] words the cat 2 1 It is crucial to keep in mind that subscripting by name always requires double quotes: the name is a string, and should be marked as such. Make sure you understand the following examples. > z.table z 12

13 > z.table[5] 11 5 > z.table["11"] 11 5 > z.table["5"] 5 4 If you subscript with the number 5, you ask for the fifth element, which has label 11 and the value 5. It is often much clearer to address the table by the name itself, in which case you need double quotes. So with z.table["11"] you ask for the count of elevens in the vector. Up till now, we have created vectors in R. In order to load already existing data files with numbers or strings into R, there is the function scan(). > n = scan (file = "DATA/numbers.txt") Read 12 items > w = scan(file = "DATA/words.txt", what = "character") Read 8 items > n [1] > w [1] "an" "example" "of" "a" "file" [6] "with" "some" "words" Here is a summary of the functions and operators discussed thus far. Make sure you are confident about what they do before you read on. arithmetic * / sqrt() creating vectors : c() seq() rep() loading vectors scan() reordering vector elements rev() sort() sample() summarizing vectors length() unique() table() vector indexes table objects which() names(), as.numeric() general q(), ls() 13

14 1.1.4 Matrices Just as it is often quite handy to bring numbers together in a vector, it is often the case that we will need to bring vectors together into tables. The functions cbind() and rbind() bind vectors by column and by row respectively: > a = c(1, 2, 3, 4) > b = c(5, 6, 7, 8) > c = c(9, 10, 11, 12) > A = cbind(a, b, c) > A a b c [1,] [2,] [3,] [4,] > B = rbind(a, b, c) > B [,1] [,2] [,3] [,4] a b c Tables of numbers such as A and B are referred to as matrices. Note that when you use cbind, the names of the vectors appear as column labels, while for rbind(), they appear as row labels. As with vectors, we will often need to access specific elements, or rows, or columns of a matrix. To do so, we use the same subscripting mechanism with square brackets, but we now separate rows and columns by means of a comma. Information preceding the comma pertains to rows, information following the comma concerns columns. Have a look at the these examples of subscripting > A[1, ] # select the first row a b c > A[2, ] # select the second row a b c > A[, "c"] # select the column labelled "c" [1] > A[, 3] # select the third column [1] > B["b", ] # select the row labelled "b" [1]

15 > B[, 2] # select the second column a b c > C = B[1:2, 4:2] > C [,1] [,2] [,3] a b Note that the matrix C inherited the row labels from matrix B. The dimensions of a matrix can be queried with the function dim(): > dim(c) [1] 2 3 What dim() returns is a vector with two elements specifying the number of rows and the number of columns. When we created the matrices A and B, we did so by first creating individual vectors, which we then combined. When you have a vector that you want to reformat into a matrix, you can use the matrix() function: > n # we made this vector above [1] > D = matrix(n, 6, 2) # a matrix of 6 rows and 2 columns > D [,1] [,2] [1,] 2 14 [2,] 4 16 [3,] 6 17 [4,] 8 18 [5,] [6,] > E = matrix(n, 4, 3) # a matrix of 4 rows and 3 columns > E [,1] [,2] [,3] [1,] [2,] [3,] [4,] > F = matrix(0, 2, 2) # a 2 by 2 matrix of zeros > F [,1] [,2] [1,] 0 0 [2,]

16 You can add or subtract matrices that have the same dimensions. If you multiply a matrix with a number, each element of the matrix is multiplied by that number. The same principle applies to arithmetic functions such as sqrt() or log(). > E * 2 [,1] [,2] [,3] [1,] [2,] [3,] [4,] > E + 2*E [,1] [,2] [,3] [1,] [2,] [3,] [4,] > log(e + 1) # natural logarithm [,1] [,2] [,3] [1,] [2,] [3,] [4,] Vectors of strings can also be combined into tables, but these are referred to as arrays and not as matrices: > X = rbind(c("this", "is"), c("an", "array"), c("and", "not"), + c("a", "matrix")) > X [,1] [,2] [1,] "this" "is" [2,] "an" "array" [3,] "and" "not" [4,] "a" "matrix" > X[3, 1] # extract the 1st element on 3rd row [1] "and" The plus on the second line of this example is the prompt that R gives instead of the > when the command on the previous line is not complete Data frames The elements of an array and a matrix should all be of the same type. When you try to combine vectors of different types, one of the vectors will be converted to the type of the other, as in the following example: 16

17 > words.table # we made this table above words cat mat on sat the > wrong = cbind(names(words.table), as.numeric(words.table)) > wrong [,1] [,2] [1,] "cat" "1" [2,] "mat" "1" [3,] "on" "1" [4,] "sat" "1" [5,] "the" "2" > rm(wrong) # delete wrong from the workspace The second column of wrong is not a vector of numbers, but a vector of strings, so we cannot do any numerical operations on this vector any more. Fortunately, R provides a special kind of table in which you can bring together vectors of different types. This data type is known as a data frame. Here is how you can create a data frame to replace wrong: > right = data.frame(words = names(words.table), + frequency = as.numeric(words.table)) > right words frequency 1 cat 1 2 mat 1 3 on 1 4 sat 1 5 the 2 We supplied two vectors to the data.frame() function, which we named words and frequency. These names appear as the column labels of the data frame right. There are three ways in which you can access the columns of a data frame, and two ways to access its rows. > right[, 2] # the second column [1] > right[,"frequency"] # the column labelled "frequency" [1] > right$frequency # the $ operator saves typing [1] > right["1", ] # the row with the rowname "1" words frequency 1 cat 1 > right[1, ] # the first row 17

18 words frequency 1 cat 1 New is the $ operator, which provides a convenient way for addressing columns. Data frames have both rownames and column names, which can be extracted with the functions rownames() and colnames(): > rownames(right) [1] "1" "2" "3" "4" "5" > colnames(right) [1] "words" "frequency" The outputs of rownames() and colnames() are themselves vectors, and the lengths of these vectors are identical to the dimensions returned by dim(): > length(rownames(right)) == dim(right)[1] [1] TRUE > length(colnames(right)) == dim(right)[2] [1] TRUE In order to illustrate the use of data frames, let s consider a real example of a dataset studied by Baayen and Hay [2004]. Baayen and Hay were interested in the extent to nonlinguistic cognition is affected by one s language. More specifically, they investigated to what extent one s knowledge of the names for objects and the lexical properties of these names influence the way we think about these objects. The way they addressed this question is by means of an experiment with 81 concrete words, names for animals as well as fruits, nuts and vegetables. They asked 20 subjects to indicate, for each of these words, on a seven point scale how heavy they thought the word s referent was. The result is a data set with 20 * 81 = 1620 subjective weight estimates. The experimental data are available in the DATA directory as weight.ratings.txt. This file has 1620 lines, one for each rating elicited for a given subject and a given word. A given line also lists the sex of the speaker, as well as many different lexical variables. We load this data file into R with read.table(): > weight = read.table("data/weight.ratings.txt", header = TRUE) The option header = TRUE specifies that the first line in weight.ratings.txt is a header that specifies the names for the different columns. This option can be abbreviated to T. It makes no sense to type weight to the R prompt, as weight is a very large object, as can be seen with dim(): > dim(weight) [1] It is much more informative to inspect, say, the first four lines, 18

19 weight[1:4,] Subject Rating Trial Sex Word Frequency FamilySize 1 A1 5 1 F horse A1 1 2 F gherkin A1 3 3 F hedgehog A1 1 4 F bee SynsetCount Length Class FreqSingular FreqPlural animal plant animal animal DerivEntropy Complex rinfl meanrt SubjFreq meansize simplex simplex simplex simplex BNCw BNCc BNCd BNCcRatio BNCdRatio or to ask for the column names: > colnames(weight) [1] "Subject" "Rating" "Trial" [4] "Sex" "Word" "Frequency" [7] "FamilySize" "SynsetCount" "Length" [10] "Class" "FreqSingular" "FreqPlural" [13] "DerivEntropy" "Complex" "rinfl" [16] "meanrt" "SubjFreq" "meansize" [19] "BNCw" "BNCc" "BNCd" [22] "BNCcRatio" "BNCdRatio" The first column lists the (anonymized) subjects in the experiment, each of which contributed 81 ratings: > table(weight$subject) A1 A2 G H I1 I2 J K L M1 M2 P R1 R2 R3 R4 R5 S1 S2 T In order to obtain just the names of the different subjects, we have two options. > unique(weight$subject) 19

20 [1] A1 A2 G H I1 I2 J K L M1 M2 P R1 R2 R3 R4 R5 S1 [19] S2 T1 20 Levels: A1 A2 G H I1 I2 J K L M1 M2 P R1 R2 R3 R4... T1 > levels(weight$subject) [1] "A1" "A2" "G" "H" "I1" "I2" "J" "K" "L" "M1" "M2" [12] "P" "R1" "R2" "R3" "R4" "R5" "S1" "S2" "T1" Note that when we apply unique() to the column of the data frame listing the subjects, we not only obtain the names of the subjects, but also some additional information, namely, that there are twenty levels that are subsequently listed in summary form. The reason for this additional information is that string vectors in a data frame are converted automatically to factors. A factor in statistics is a variable that has strings as its possible values. Another factor in the weight data frame is Sex, which has as its levels F (female) and M (male). The distinction between a string vector and a factor is crucial for many of the tools that we will be using in later chapters. The simplest way to see what subjects participated is to use the function that simply returns the levels of a factor, levels(). A related summary function is nlevels(), which returns the number of levels: > nlevels(weight$subject) [1] 20 The column labelled Word specifies the word for which a rating was elicited. We extract a list of these words with levels(): > levels(weight$word) [1] "almond" "ant" "apple" "apricot" [5] "asparagus" "avocado" "badger" "banana" [9] "bat" "beaver" "bee" "beetroot" [13] "blackberry" "blueberry" "broccoli" "bunny" [17] "butterfly" "camel" "carrot" "cat" [21] "cherry" "chicken" "clove" "crocodile" [25] "cucumber" "dog" "dolphin" "donkey" [29] "eagle" "eggplant" "elephant" "fox" [33] "frog" "gherkin" "goat" "goose" [37] "grape" "gull" "hedgehog" "horse" [41] "kiwi" "leek" "lemon" "lettuce" [45] "lion" "magpie" "melon" "mole" [49] "monkey" "moose" "mouse" "mushroom" [53] "mustard" "olive" "orange" "owl" [57] "paprika" "peanut" "pear" "pig" [61] "pigeon" "pineapple" "potato" "radish" [65] "reindeer" "shark" "sheep" "snake" [69] "spider" "squid" "squirrel" "stork" [73] "strawberry" "swan" "tomato" "tortoise" 20

21 [77] "vulture" "walnut" "wasp" "whale" [81] "woodpecker" The third column of weight lists the trial number, if a word has trial number 4, it was the fourth word in the experiment that a given subject was asked to rate. This variable ranges from 1 to 81, and is a control variable that allows us to trace possible effects of learing or fatigue that might take place in the course of the experiment. The remaining 18 columns specify various properties of the words. These are briefly described in the appendix. Variables of interest for the present example are a word s frequency (Frequency), its family size (the number of complex words in which it appears as a constituent, FamilySize), the number of synonym sets (synsets) in which it is listed in WordNet [Miller, 1990, Beckwith et al., 1991, Fellbaum, 1998], its length in letters (Length), its Class (plant or animal), and its derivational entropy, a token-weighted variant of the family size count, [Moscoso del Prado Martín et al., 2004]. All lexical variables for a given word are repeated on twenty rows of weight, once for each subject. In order to obtain a data frame that is restricted to the information pertaining only to the items, we use the unique() function as follows: > items = unique(weight[, 5:23]) # skip columns with information > dim(items) # specific to subject and trial [1] The first four columns of weight contain information about the subject and the trial itself. This is information we want to discard in order to obtain a data frame that summarizes the properties of the words in the experiment. From column 5 onwards, we have the information that is specific to the items. The subsection of the data frame obtained by weight[, 5:23] has 81 unique lines, each of which is repeated 20 times, once for each subject. With unique, we remove the redundant lines, and retain exactly one instance of each unique line. This data frame still has more columns then we need at this moment. Let s consider how we can create a smaller data frame with exactly the relevant information: > items = items[,c(1:6,9)] # or > items = items[,c("word", "Frequency", "FamilySize", + "SynsetCount", "Length", "Class", "DerivEntropy")] > items[1:4,] Word Frequency FamilySize SynsetCount Length Class 1 horse animal 2 gherkin plant 3 hedgehog animal 4 bee animal DerivEntropy

22 The first two commands select exactly the same subset of columns. The first uses the column numbers, the second the column names. It is sometimes convenient to relabel column names or rownames. For instance, we can rename the rownames with the words, > rownames(items) = as.character(items$word) > items[1:4,1:6] Word Frequency FamilySize SynsetCount Length Class horse horse animal gherkin gherkin plant hedgehog hedgehog animal bee bee animal which makes it easy to extract information from the data frame by name: > items["bat", "Frequency"] [1] > items["pig", "Length"] [3] 3 The function as.character() used above converts the factor items$word to a vector of strings. This conversion is necessary because the row names and the column names are vectors of strings and not factors. It is very important to become fluent in subscripting data frames. You should keep in mind that restrictions on rows precede the comma, and restrictions on columns follow the comma in the subscripting sequence [,]. Here are some examples: > items[items$length == 3, 2:4] Frequency FamilySize SynsetCount bee pig fox bat dog owl cat ant > items2 = items[items$length > 5 & items$length < 7, c(1,3)][1:5,] > items2 Word FamilySize peanut peanut pigeon pigeon tomato tomato

23 donkey donkey magpie magpie > items[items$length > 7 items$length == 3, + c("word", "FamilySize")][1:5,] Word FamilySize hedgehog hedgehog bee bee pineapple pineapple blackberry blackberry tortoise tortoise The second and third example illustrate the logical connectives and (&) and or ( ). They also illustrate that you can subscript the part of the data frame that you just subscripted. After all, the result of subscripting a dataframe is a new, smaller dataframe, that can in turn be subscripted. Subscripting with [1:5, ] is a convenient way for inspecting the first couple of lines of a data frame. You sort the lines in a data frame with the function order(): > items2[order(items2$word),] Word FamilySize donkey donkey magpie magpie peanut peanut pigeon pigeon tomato tomato When you call order() with items2$word as argument, it returns a vector with the row numbers of the words such that the words themselves are sorted: > order(items2$word) [1] When this vector is inserted in the row slot of the subscript of items2, its rows are rearranged in this order. When order is supplied with more than one argument, it will sort on the first argument, and resolve ties by looking at the second argument, or the third, if required, and so on. > items2[order(items2$familysize, items2$word),] Word FamilySize donkey donkey magpie magpie tomato tomato peanut peanut pigeon pigeon

24 1.1.6 Random variables Thus far, we have encountered two kinds of vectors: numerical vectors and factors. Both are used to represent random variables. A random variable is the outcome of an experiment. Here are some examples of experiments and their associated random variables: tossing a coin a random variable with values head or tail. throwing a dice a random variable with values 1, 2,..., 6. counting words a random variable with as values the frequency of occurrence in some corpus: 0, 1, 2,..., N (with N the size of the corpus). familiarity rating the subjective estimate of frequency, usually on a scale of 1 to 7, is the random variable in this experiment. lexical decision this kind of experiment has two associated random variables: the accuracy of a response (with levels correct and incorrect ) and the latency of the response (in milliseconds). A random variable is random in the sense that the outcome of a given experiment is not fully predictable. The opposite of a random variable is a constant. The size of a given corpus such as the Brown corpus [Kučera and Francis, 1967] in word tokens is fixed, hence by itself this specific corpus size is not a random variable. On the other hand, corpus size, defined as ranging over many different corpora, is a random variable, because we cannot say what the corpus size is without being told what corpus we are dealing with. The art of statistics is to learn from prior experience with a given random variable (or sets of random variables) in order to optimize one s predictions as to what the most likely value of a random variable is Summary Before starting with the section on visualization, make sure you are confident about the use of the functions that were introduced in this section. creating tables cbind() rbind() matrix() data frames $ data.frame() rownames() colnames() read.table() properties dim() sorting order() selecting unique() type conversion as.character() general rm() 24

25 1.2 Visualization An important first step in exploratory data analysis is to inspect your data graphically. It is difficult and often downright impossible to make sense of large tables of numbers. But patterns in the data often become visible thanks to the tools for data visualization that are now available. We first discuss tools for visualizing properties of single random variables (in vectors and uni-dimensional tables), and then proceed with an overview of tools for graphing groups of random variables (typically brought together in matrices or data frames) Visualizing single random variables Bar plots and histograms are useful for obtaining visual summaries of the distributions of random variables. Figure 1.1 illustrates this for the numeric variables in the items data frame that describes the main properties of the words used in the rating experiment eliciting subjective estimates of the referent s weight. This figure has six panels arranged in a matrix of three rows and two columns. In order to instruct R to make such a matrix of plots, we have to set the appropriate graphics parameter, mfrow to the vector c(3, 2) using the function par(), which controls a large number of graphical parameters. Plots will be added one by one to the plot region, proceeding row by row from left to right. par(mfrow=c(3,2)) The upper left panel is a bar plot of the counts of word lengths: barplot(table(items$length), xlab="word length", col="grey") The option xlab sets the label for the X axis, and with the option col we set the color for the bars to grey. We see that word lengths range from 3 to 10, and that the distribution is somewhat asymmetric, with a mode (the value observed most often) at 5. The mean is 5.9, and the median is 6. (The median is obtained by ordering the observations from small to large, and then taking the value for which 50% of the data points are smaller.) Mean, median, and range are easy to extract with the corresponding functions mean(), median(), and range(): > mean(items$length) [1] > median(items$length) [1] 6 > range(items$length) [1] 3 10 There are also separate functions for extracting the minimum and the maximum: 25

26 > min(items$length) [1] 3 > max(items$length) [1] 10 The upper right panel of Figure 1.1 shows the histogram corresponding to the bar plot in the upper left panel. The main difference between the bar plot and the histogram is that the latter is scaled on the vertical axis in such a way that the total area of the bars is equal to 1. This allows us to see that the probability of the word lengths 5 and 6 jointly is close to 0.5. This histogram was produced with the truehist() function in the MASS library of Venables and Ripley [2003], that is part of any recent distribution of R. In order to access this function, we need to load this library with the library() function, library(mass) after which we can produce the histogram in the upper left panel with truehist(items$length, xlab="word length", col="grey") The remaining panels of Figure 1.1 were made in the same way. truehist(items$frequency, xlab = "log word frequency", col = "grey") truehist(items$synsetcount, xlab = "log synset count", col = "grey") truehist(items$familysize, xlab = "log family size", col = "grey") truehist(items$deriventropy, xlab = "derivational entropy", col = "grey") Note that the bottom panels show highly skewed distributions: Most of the words in this experiment have no morphological family members at all. Now that all panels have been filled, we reset the graphics parameter to one figure in the plot region: par(mfrow=c(1, 1)) There are several ways in which plots can be saved as independent graphics files: as png or jpeg files, or as PostScript files. The corresponding functions are png(), jpeg(), and postscript(). We illustrate how these functions work for PostScript. > postscript("barplot.ps", horizontal = FALSE, he = 6, wi = 6, + family = "Helvetica", paper = "special", onefile = FALSE) > truehist(items$frequency, xlab = "log word frequency") > dev.off() The first argument of postscript() is the name of the PostScript file to be created. Whether the plot should be in portrait or landscape mode is controlled by the horizontal argument. The parameters he and wi control the height and width of the plot in inches. The font to be used is specified by family, and with paper="special" the output will be an encapsulated PostScript file that can be easily incorporated in, for instance, a L A TEX document. The final argument, onefile, is set to FALSE in order to indicate there is 26

27 word length word length log word frequency log synset count log family size derivational entropy Figure 1.1: A bar plot and histograms for the variables describing the lexical properties of the words used in the weight rating experiment. 27

28 only a single plot in the file. There are many more options, check the on-line help for further details. The postscript() command opens a PostScript file, and all following plot commands are diverted to this PostScript file. To close the postscript file, you use the function dev.off(). After this command, new plots will appear in the graphics window as usual. The dev.off() command is crucial: If you forget to close your file, you will run into all sorts of trouble when you try to view the file outside R, or if you try to make a new figure in R. The shape of a histogram depends, sometimes to a surprising extent, on the width of the bars and on the position of the left side of the first bar. The function truehist() has defaults that are chosen to minimize the risk of obtaining a rather arbitrarily shaped histogram (see also Haerdle, 1991). A function that further reduces this risk is density(). We illustrate this function for the reaction times elicited in a visual lexical decision experiment using the same words as in the weight rating experiment. In a visual lexical decision experiment, words are presented on a computer screen together with non-existing words like sulp. Subjects are asked to indicate as quickly as possible by means of two push buttons whether the letter string presented on the screen in a real word. The time between the moment that the word is displayed on the screen and the moment at which a button response is recorded is the reaction time (also referred to as response latency). It is a measure of the complexity of lexical processing which is known to be co-determined by a wide range of lexical variables. The reaction times for 79 of the 81 words discussed above are available in the DATA directory as the text file lexdec.txt. Further details about the variables in this data set, see the appendix. The upper left panel of Figure 1.2 shows the histogram as given by truehist() applied to the (logarithmically transformed) reaction times. lexdec = read.table("data/lexdec.txt",t) truehist(lexdec$rt, col="lightgrey", xlab="log RT") The distribution of reaction times is somewhat skewed, with an extended right tail of very long latencies. The upper right panel of Figure 1.2 shows the histogram as produced by hist() instead of truehist(), together with the density curve. The two have roughly the same shape, but the density curve smoothes the discrete jumps of the histogram. The lower left panel uses hist(), but now with the same bin widths as truehist(). The histogram and the density curve are now very similar estimates of the distribution of reaction times. Plotting the upper right and lower left panels requires some careful preparation in order to make sure that the ranges of values for the two axes are set properly to accomodate both the histogram and the density function. We therefore begin with the standard function for making a histogram, hist(), which we force to make the same bins as truehist() in the case of the lower left panel, by specifying the breaks (the points where new bins should begin) explicitly. Instead of plotting the histogram, we save it, so that we can extract the range of values for the horizontal and vertical axes. > h = hist(lexdec$rt, freq = FALSE, plot = FALSE, 28

29 log RT log RT log RT Figure 1.2: Histograms and density function for the response latencies of 21 subjects to 79 nouns referring to animals and plants (fruits and vegetables). + breaks = seq(5.8, 7.6, by = 0.1)) # lower left panel We then repeat this procedure for the density curve, > d = density(lexdec$rt) and then set the X and Y limits: > xlimit = range(h$breaks, range(d$x)) > ylimit = range(0, h$density, d$y) Finally, we plot the histogram, and add the curve for the density with the function lines(). The function lines() takes a vector of x coordinates and a vector of y coordinates, and connects the points specified by these coordinates with a line (in the order specified by the input vectors). > hist(lexdec$rt, freq = FALSE, col = "lightgrey", + border = "darkgrey", ylab = "", xlab = "log RT", + xlim = xlimit, ylim = ylimit, main = "", + breaks = seq(5.8, 7.6, by = 0.1)) > lines(d) The border option of hist() controls the color of the lines marking the bars of the histogram. We prevent hist() from adding a main plot title by setting main to the 29

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize

More information

Getting Started with R and RStudio 1

Getting Started with R and RStudio 1 Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following

More information

Drawing a histogram using Excel

Drawing a histogram using Excel Drawing a histogram using Excel STEP 1: Examine the data to decide how many class intervals you need and what the class boundaries should be. (In an assignment you may be told what class boundaries to

More information

Excel 2003 Tutorial I

Excel 2003 Tutorial I This tutorial was adapted from a tutorial by see its complete version at http://www.fgcu.edu/support/office2000/excel/index.html Excel 2003 Tutorial I Spreadsheet Basics Screen Layout Title bar Menu bar

More information

SPSS Manual for Introductory Applied Statistics: A Variable Approach

SPSS Manual for Introductory Applied Statistics: A Variable Approach SPSS Manual for Introductory Applied Statistics: A Variable Approach John Gabrosek Department of Statistics Grand Valley State University Allendale, MI USA August 2013 2 Copyright 2013 John Gabrosek. All

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques

More information

Participant Guide RP301: Ad Hoc Business Intelligence Reporting

Participant Guide RP301: Ad Hoc Business Intelligence Reporting RP301: Ad Hoc Business Intelligence Reporting State of Kansas As of April 28, 2010 Final TABLE OF CONTENTS Course Overview... 4 Course Objectives... 4 Agenda... 4 Lesson 1: Reviewing the Data Warehouse...

More information

Ohio University Computer Services Center August, 2002 Crystal Reports Introduction Quick Reference Guide

Ohio University Computer Services Center August, 2002 Crystal Reports Introduction Quick Reference Guide Open Crystal Reports From the Windows Start menu choose Programs and then Crystal Reports. Creating a Blank Report Ohio University Computer Services Center August, 2002 Crystal Reports Introduction Quick

More information

STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc.

STATGRAPHICS Online. Statistical Analysis and Data Visualization System. Revised 6/21/2012. Copyright 2012 by StatPoint Technologies, Inc. STATGRAPHICS Online Statistical Analysis and Data Visualization System Revised 6/21/2012 Copyright 2012 by StatPoint Technologies, Inc. All rights reserved. Table of Contents Introduction... 1 Chapter

More information

OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE

OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE Hukum Chandra Indian Agricultural Statistics Research Institute, New Delhi-110012 1. INTRODUCTION R is a free software environment for statistical computing

More information

SPSS: Getting Started. For Windows

SPSS: Getting Started. For Windows For Windows Updated: August 2012 Table of Contents Section 1: Overview... 3 1.1 Introduction to SPSS Tutorials... 3 1.2 Introduction to SPSS... 3 1.3 Overview of SPSS for Windows... 3 Section 2: Entering

More information

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS SECTION 2-1: OVERVIEW Chapter 2 Describing, Exploring and Comparing Data 19 In this chapter, we will use the capabilities of Excel to help us look more carefully at sets of data. We can do this by re-organizing

More information

Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller

Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller Most of you want to use R to analyze data. However, while R does have a data editor, other programs such as excel are often better

More information

Microsoft Excel Tips & Tricks

Microsoft Excel Tips & Tricks Microsoft Excel Tips & Tricks Collaborative Programs Research & Evaluation TABLE OF CONTENTS Introduction page 2 Useful Functions page 2 Getting Started with Formulas page 2 Nested Formulas page 3 Copying

More information

Microsoft Access Basics

Microsoft Access Basics Microsoft Access Basics 2006 ipic Development Group, LLC Authored by James D Ballotti Microsoft, Access, Excel, Word, and Office are registered trademarks of the Microsoft Corporation Version 1 - Revision

More information

Creating Interactive PDF Forms

Creating Interactive PDF Forms Creating Interactive PDF Forms Using Adobe Acrobat X Pro Information Technology Services Outreach and Distance Learning Technologies Copyright 2012 KSU Department of Information Technology Services This

More information

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P. SQL databases An introduction AMP: Apache, mysql, PHP This installations installs the Apache webserver, the PHP scripting language, and the mysql database on your computer: Apache: runs in the background

More information

Data exploration with Microsoft Excel: univariate analysis

Data exploration with Microsoft Excel: univariate analysis Data exploration with Microsoft Excel: univariate analysis Contents 1 Introduction... 1 2 Exploring a variable s frequency distribution... 2 3 Calculating measures of central tendency... 16 4 Calculating

More information

Data exploration with Microsoft Excel: analysing more than one variable

Data exploration with Microsoft Excel: analysing more than one variable Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

Microsoft Excel 2010 Part 3: Advanced Excel

Microsoft Excel 2010 Part 3: Advanced Excel CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES Microsoft Excel 2010 Part 3: Advanced Excel Winter 2015, Version 1.0 Table of Contents Introduction...2 Sorting Data...2 Sorting

More information

0 Introduction to Data Analysis Using an Excel Spreadsheet

0 Introduction to Data Analysis Using an Excel Spreadsheet Experiment 0 Introduction to Data Analysis Using an Excel Spreadsheet I. Purpose The purpose of this introductory lab is to teach you a few basic things about how to use an EXCEL 2010 spreadsheet to do

More information

Classroom Tips and Techniques: The Student Precalculus Package - Commands and Tutors. Content of the Precalculus Subpackage

Classroom Tips and Techniques: The Student Precalculus Package - Commands and Tutors. Content of the Precalculus Subpackage Classroom Tips and Techniques: The Student Precalculus Package - Commands and Tutors Robert J. Lopez Emeritus Professor of Mathematics and Maple Fellow Maplesoft This article provides a systematic exposition

More information

Figure 1. An embedded chart on a worksheet.

Figure 1. An embedded chart on a worksheet. 8. Excel Charts and Analysis ToolPak Charts, also known as graphs, have been an integral part of spreadsheets since the early days of Lotus 1-2-3. Charting features have improved significantly over the

More information

Graphics in R. Biostatistics 615/815

Graphics in R. Biostatistics 615/815 Graphics in R Biostatistics 615/815 Last Lecture Introduction to R Programming Controlling Loops Defining your own functions Today Introduction to Graphics in R Examples of commonly used graphics functions

More information

Exercise 4 Learning Python language fundamentals

Exercise 4 Learning Python language fundamentals Exercise 4 Learning Python language fundamentals Work with numbers Python can be used as a powerful calculator. Practicing math calculations in Python will help you not only perform these tasks, but also

More information

Module 2 Basic Data Management, Graphs, and Log-Files

Module 2 Basic Data Management, Graphs, and Log-Files AGRODEP Stata Training April 2013 Module 2 Basic Data Management, Graphs, and Log-Files Manuel Barron 1 and Pia Basurto 2 1 University of California, Berkeley, Department of Agricultural and Resource Economics

More information

Using R for Windows and Macintosh

Using R for Windows and Macintosh 2010 Using R for Windows and Macintosh R is the most commonly used statistical package among researchers in Statistics. It is freely distributed open source software. For detailed information about downloading

More information

Exploratory Data Analysis and Plotting

Exploratory Data Analysis and Plotting Exploratory Data Analysis and Plotting The purpose of this handout is to introduce you to working with and manipulating data in R, as well as how you can begin to create figures from the ground up. 1 Importing

More information

WHO STEPS Surveillance Support Materials. STEPS Epi Info Training Guide

WHO STEPS Surveillance Support Materials. STEPS Epi Info Training Guide STEPS Epi Info Training Guide Department of Chronic Diseases and Health Promotion World Health Organization 20 Avenue Appia, 1211 Geneva 27, Switzerland For further information: www.who.int/chp/steps WHO

More information

Psy 210 Conference Poster on Sex Differences in Car Accidents 10 Marks

Psy 210 Conference Poster on Sex Differences in Car Accidents 10 Marks Psy 210 Conference Poster on Sex Differences in Car Accidents 10 Marks Overview The purpose of this assignment is to compare the number of car accidents that men and women have. The goal is to determine

More information

Plotting: Customizing the Graph

Plotting: Customizing the Graph Plotting: Customizing the Graph Data Plots: General Tips Making a Data Plot Active Within a graph layer, only one data plot can be active. A data plot must be set active before you can use the Data Selector

More information

Microsoft Access 2010 Overview of Basics

Microsoft Access 2010 Overview of Basics Opening Screen Access 2010 launches with a window allowing you to: create a new database from a template; create a new template from scratch; or open an existing database. Open existing Templates Create

More information

Using Excel for Data Manipulation and Statistical Analysis: How-to s and Cautions

Using Excel for Data Manipulation and Statistical Analysis: How-to s and Cautions 2010 Using Excel for Data Manipulation and Statistical Analysis: How-to s and Cautions This document describes how to perform some basic statistical procedures in Microsoft Excel. Microsoft Excel is spreadsheet

More information

Intro to Excel spreadsheets

Intro to Excel spreadsheets Intro to Excel spreadsheets What are the objectives of this document? The objectives of document are: 1. Familiarize you with what a spreadsheet is, how it works, and what its capabilities are; 2. Using

More information

Microsoft Access 2010 Part 1: Introduction to Access

Microsoft Access 2010 Part 1: Introduction to Access CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES Microsoft Access 2010 Part 1: Introduction to Access Fall 2014, Version 1.2 Table of Contents Introduction...3 Starting Access...3

More information

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

Netigate User Guide. Setup... 2. Introduction... 5. Questions... 6. Text box... 7. Text area... 9. Radio buttons...10. Radio buttons Weighted...

Netigate User Guide. Setup... 2. Introduction... 5. Questions... 6. Text box... 7. Text area... 9. Radio buttons...10. Radio buttons Weighted... Netigate User Guide Setup... 2 Introduction... 5 Questions... 6 Text box... 7 Text area... 9 Radio buttons...10 Radio buttons Weighted...12 Check box...13 Drop-down...15 Matrix...17 Matrix Weighted...18

More information

AMATH 352 Lecture 3 MATLAB Tutorial Starting MATLAB Entering Variables

AMATH 352 Lecture 3 MATLAB Tutorial Starting MATLAB Entering Variables AMATH 352 Lecture 3 MATLAB Tutorial MATLAB (short for MATrix LABoratory) is a very useful piece of software for numerical analysis. It provides an environment for computation and the visualization. Learning

More information

DataPA OpenAnalytics End User Training

DataPA OpenAnalytics End User Training DataPA OpenAnalytics End User Training DataPA End User Training Lesson 1 Course Overview DataPA Chapter 1 Course Overview Introduction This course covers the skills required to use DataPA OpenAnalytics

More information

InfiniteInsight 6.5 sp4

InfiniteInsight 6.5 sp4 End User Documentation Document Version: 1.0 2013-11-19 CUSTOMER InfiniteInsight 6.5 sp4 Toolkit User Guide Table of Contents Table of Contents About this Document 3 Common Steps 4 Selecting a Data Set...

More information

R: A self-learn tutorial

R: A self-learn tutorial R: A self-learn tutorial 1 Introduction R is a software language for carrying out complicated (and simple) statistical analyses. It includes routines for data summary and exploration, graphical presentation

More information

Part 1 Foundations of object orientation

Part 1 Foundations of object orientation OFWJ_C01.QXD 2/3/06 2:14 pm Page 1 Part 1 Foundations of object orientation OFWJ_C01.QXD 2/3/06 2:14 pm Page 2 1 OFWJ_C01.QXD 2/3/06 2:14 pm Page 3 CHAPTER 1 Objects and classes Main concepts discussed

More information

Using Excel for Statistical Analysis

Using Excel for Statistical Analysis 2010 Using Excel for Statistical Analysis Microsoft Excel is spreadsheet software that is used to store information in columns and rows, which can then be organized and/or processed. Excel is a powerful

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

8 CREATING FORM WITH FORM WIZARD AND FORM DESIGNER

8 CREATING FORM WITH FORM WIZARD AND FORM DESIGNER 8 CREATING FORM WITH FORM WIZARD AND FORM DESIGNER 8.1 INTRODUCTION Forms are very powerful tool embedded in almost all the Database Management System. It provides the basic means for inputting data for

More information

While Loops and Animations

While Loops and Animations C h a p t e r 6 While Loops and Animations In this chapter, you will learn how to use the following AutoLISP functions to World Class standards: 1. The Advantage of Using While Loops and Animation Code

More information

Introduction to Matlab

Introduction to Matlab Introduction to Matlab Social Science Research Lab American University, Washington, D.C. Web. www.american.edu/provost/ctrl/pclabs.cfm Tel. x3862 Email. SSRL@American.edu Course Objective This course provides

More information

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class

Homework 4 Statistics W4240: Data Mining Columbia University Due Tuesday, October 29 in Class Problem 1. (10 Points) James 6.1 Problem 2. (10 Points) James 6.3 Problem 3. (10 Points) James 6.5 Problem 4. (15 Points) James 6.7 Problem 5. (15 Points) James 6.10 Homework 4 Statistics W4240: Data Mining

More information

STAT10020: Exploratory Data Analysis

STAT10020: Exploratory Data Analysis STAT10020: Exploratory Data Analysis Statistical Programming with R Lab 1 Log in using your student number and password. Click on the start button and then click on Application Window. Choose the Mathematics

More information

Microsoft Access 3: Understanding and Creating Queries

Microsoft Access 3: Understanding and Creating Queries Microsoft Access 3: Understanding and Creating Queries In Access Level 2, we learned how to perform basic data retrievals by using Search & Replace functions and Sort & Filter functions. For more complex

More information

Agenda2. User Manual. Agenda2 User Manual Copyright 2010-2013 Bobsoft 1 of 34

Agenda2. User Manual. Agenda2 User Manual Copyright 2010-2013 Bobsoft 1 of 34 Agenda2 User Manual Agenda2 User Manual Copyright 2010-2013 Bobsoft 1 of 34 Agenda2 User Manual Copyright 2010-2013 Bobsoft 2 of 34 Contents 1. User Interface! 5 2. Quick Start! 6 3. Creating an agenda!

More information

Access Tutorial 8: Combo Box Controls

Access Tutorial 8: Combo Box Controls Access Tutorial 8: Combo Box Controls 8.1 Introduction: What is a combo box? So far, the only kind of control you have used on your forms has been the text box. However, Access provides other controls

More information

January 26, 2009 The Faculty Center for Teaching and Learning

January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS A USER GUIDE January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS Table of Contents Table of Contents... i

More information

COGNOS Query Studio Ad Hoc Reporting

COGNOS Query Studio Ad Hoc Reporting COGNOS Query Studio Ad Hoc Reporting Copyright 2008, the California Institute of Technology. All rights reserved. This documentation contains proprietary information of the California Institute of Technology

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

Business Objects Version 5 : Introduction

Business Objects Version 5 : Introduction Business Objects Version 5 : Introduction Page 1 TABLE OF CONTENTS Introduction About Business Objects Changing Your Password Retrieving Pre-Defined Reports Formatting Your Report Using the Slice and Dice

More information

SAS Analyst for Windows Tutorial

SAS Analyst for Windows Tutorial Updated: August 2012 Table of Contents Section 1: Introduction... 3 1.1 About this Document... 3 1.2 Introduction to Version 8 of SAS... 3 Section 2: An Overview of SAS V.8 for Windows... 3 2.1 Navigating

More information

Introduction to the data.table package in R

Introduction to the data.table package in R Introduction to the data.table package in R Revised: September 18, 2015 (A later revision may be available on the homepage) Introduction This vignette is aimed at those who are already familiar with creating

More information

Tips and Tricks SAGE ACCPAC INTELLIGENCE

Tips and Tricks SAGE ACCPAC INTELLIGENCE Tips and Tricks SAGE ACCPAC INTELLIGENCE 1 Table of Contents Auto e-mailing reports... 4 Automatically Running Macros... 7 Creating new Macros from Excel... 8 Compact Metadata Functionality... 9 Copying,

More information

Creating Custom Crystal Reports Tutorial

Creating Custom Crystal Reports Tutorial Creating Custom Crystal Reports Tutorial 020812 2012 Blackbaud, Inc. This publication, or any part thereof, may not be reproduced or transmitted in any form or by any means, electronic, or mechanical,

More information

Preface of Excel Guide

Preface of Excel Guide Preface of Excel Guide The use of spreadsheets in a course designed primarily for business and social science majors can enhance the understanding of the underlying mathematical concepts. In addition,

More information

Switching from PC SAS to SAS Enterprise Guide Zhengxin (Cindy) Yang, inventiv Health Clinical, Princeton, NJ

Switching from PC SAS to SAS Enterprise Guide Zhengxin (Cindy) Yang, inventiv Health Clinical, Princeton, NJ PharmaSUG 2014 PO10 Switching from PC SAS to SAS Enterprise Guide Zhengxin (Cindy) Yang, inventiv Health Clinical, Princeton, NJ ABSTRACT As more and more organizations adapt to the SAS Enterprise Guide,

More information

Scientific Graphing in Excel 2010

Scientific Graphing in Excel 2010 Scientific Graphing in Excel 2010 When you start Excel, you will see the screen below. Various parts of the display are labelled in red, with arrows, to define the terms used in the remainder of this overview.

More information

Creating and Using Databases with Microsoft Access

Creating and Using Databases with Microsoft Access CHAPTER A Creating and Using Databases with Microsoft Access In this chapter, you will Use Access to explore a simple database Design and create a new database Create and use forms Create and use queries

More information

Gestation Period as a function of Lifespan

Gestation Period as a function of Lifespan This document will show a number of tricks that can be done in Minitab to make attractive graphs. We work first with the file X:\SOR\24\M\ANIMALS.MTP. This first picture was obtained through Graph Plot.

More information

ABSTRACT INTRODUCTION EXERCISE 1: EXPLORING THE USER INTERFACE GRAPH GALLERY

ABSTRACT INTRODUCTION EXERCISE 1: EXPLORING THE USER INTERFACE GRAPH GALLERY Statistical Graphics for Clinical Research Using ODS Graphics Designer Wei Cheng, Isis Pharmaceuticals, Inc., Carlsbad, CA Sanjay Matange, SAS Institute, Cary, NC ABSTRACT Statistical graphics play an

More information

Using Excel for Analyzing Survey Questionnaires Jennifer Leahy

Using Excel for Analyzing Survey Questionnaires Jennifer Leahy University of Wisconsin-Extension Cooperative Extension Madison, Wisconsin PD &E Program Development & Evaluation Using Excel for Analyzing Survey Questionnaires Jennifer Leahy G3658-14 Introduction You

More information

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol

R Graphics Cookbook. Chang O'REILLY. Winston. Tokyo. Beijing Cambridge. Farnham Koln Sebastopol R Graphics Cookbook Winston Chang Beijing Cambridge Farnham Koln Sebastopol O'REILLY Tokyo Table of Contents Preface ix 1. R Basics 1 1.1. Installing a Package 1 1.2. Loading a Package 2 1.3. Loading a

More information

Basic Excel Handbook

Basic Excel Handbook 2 5 2 7 1 1 0 4 3 9 8 1 Basic Excel Handbook Version 3.6 May 6, 2008 Contents Contents... 1 Part I: Background Information...3 About This Handbook... 4 Excel Terminology... 5 Excel Terminology (cont.)...

More information

How Does My TI-84 Do That

How Does My TI-84 Do That How Does My TI-84 Do That A guide to using the TI-84 for statistics Austin Peay State University Clarksville, Tennessee How Does My TI-84 Do That A guide to using the TI-84 for statistics Table of Contents

More information

Access 2003 Introduction to Queries

Access 2003 Introduction to Queries Access 2003 Introduction to Queries COPYRIGHT Copyright 1999 by EZ-REF Courseware, Laguna Beach, CA http://www.ezref.com/ All rights reserved. This publication, including the student manual, instructor's

More information

Beginner s Matlab Tutorial

Beginner s Matlab Tutorial Christopher Lum lum@u.washington.edu Introduction Beginner s Matlab Tutorial This document is designed to act as a tutorial for an individual who has had no prior experience with Matlab. For any questions

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

Oracle Data Miner (Extension of SQL Developer 4.0)

Oracle Data Miner (Extension of SQL Developer 4.0) An Oracle White Paper September 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Integrate Oracle R Enterprise Mining Algorithms into a workflow using the SQL Query node Denny Wong Oracle Data Mining

More information

Utilizing Microsoft Access Forms and Reports

Utilizing Microsoft Access Forms and Reports Utilizing Microsoft Access Forms and Reports The 2014 SAIR Conference Workshop #3 October 4 th, 2014 Presented by: Nathan Pitts (Sr. Research Analyst The University of North Alabama) Molly Vaughn (Associate

More information

VISUAL ALGEBRA FOR COLLEGE STUDENTS. Laurie J. Burton Western Oregon University

VISUAL ALGEBRA FOR COLLEGE STUDENTS. Laurie J. Burton Western Oregon University VISUAL ALGEBRA FOR COLLEGE STUDENTS Laurie J. Burton Western Oregon University VISUAL ALGEBRA FOR COLLEGE STUDENTS TABLE OF CONTENTS Welcome and Introduction 1 Chapter 1: INTEGERS AND INTEGER OPERATIONS

More information

Excel 2010: Create your first spreadsheet

Excel 2010: Create your first spreadsheet Excel 2010: Create your first spreadsheet Goals: After completing this course you will be able to: Create a new spreadsheet. Add, subtract, multiply, and divide in a spreadsheet. Enter and format column

More information

Excel Guide for Finite Mathematics and Applied Calculus

Excel Guide for Finite Mathematics and Applied Calculus Excel Guide for Finite Mathematics and Applied Calculus Revathi Narasimhan Kean University A technology guide to accompany Mathematical Applications, 6 th Edition Applied Calculus, 2 nd Edition Calculus:

More information

Excel 2007: Basics Learning Guide

Excel 2007: Basics Learning Guide Excel 2007: Basics Learning Guide Exploring Excel At first glance, the new Excel 2007 interface may seem a bit unsettling, with fat bands called Ribbons replacing cascading text menus and task bars. This

More information

6.4 Normal Distribution

6.4 Normal Distribution Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

More information

Using an Access Database

Using an Access Database A Few Terms Using an Access Database These words are used often in Access so you will want to become familiar with them before using the program and this tutorial. A database is a collection of related

More information

Formulas, Functions and Charts

Formulas, Functions and Charts Formulas, Functions and Charts :: 167 8 Formulas, Functions and Charts 8.1 INTRODUCTION In this leson you can enter formula and functions and perform mathematical calcualtions. You will also be able to

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures

Visualizing Data. Contents. 1 Visualizing Data. Anthony Tanbakuchi Department of Mathematics Pima Community College. Introductory Statistics Lectures Introductory Statistics Lectures Visualizing Data Descriptive Statistics I Department of Mathematics Pima Community College Redistribution of this material is prohibited without written permission of the

More information

How To Use The Correlog With The Cpl Powerpoint Powerpoint Cpl.Org Powerpoint.Org (Powerpoint) Powerpoint (Powerplst) And Powerpoint 2 (Powerstation) (Powerpoints) (Operations

How To Use The Correlog With The Cpl Powerpoint Powerpoint Cpl.Org Powerpoint.Org (Powerpoint) Powerpoint (Powerplst) And Powerpoint 2 (Powerstation) (Powerpoints) (Operations orrelog SQL Table Monitor Adapter Users Manual http://www.correlog.com mailto:info@correlog.com CorreLog, SQL Table Monitor Users Manual Copyright 2008-2015, CorreLog, Inc. All rights reserved. No part

More information

R Language Fundamentals

R Language Fundamentals R Language Fundamentals Data Types and Basic Maniuplation Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Where did R come from? Overview Atomic Vectors Subsetting

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Importing and Exporting With SPSS for Windows 17 TUT 117

Importing and Exporting With SPSS for Windows 17 TUT 117 Information Systems Services Importing and Exporting With TUT 117 Version 2.0 (Nov 2009) Contents 1. Introduction... 3 1.1 Aim of this Document... 3 2. Importing Data from Other Sources... 3 2.1 Reading

More information

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable

More information

Dataframes. Lecture 8. Nicholas Christian BIOST 2094 Spring 2011

Dataframes. Lecture 8. Nicholas Christian BIOST 2094 Spring 2011 Dataframes Lecture 8 Nicholas Christian BIOST 2094 Spring 2011 Outline 1. Importing and exporting data 2. Tools for preparing and cleaning datasets Sorting Duplicates First entry Merging Reshaping Missing

More information

Hypercosm. Studio. www.hypercosm.com

Hypercosm. Studio. www.hypercosm.com Hypercosm Studio www.hypercosm.com Hypercosm Studio Guide 3 Revision: November 2005 Copyright 2005 Hypercosm LLC All rights reserved. Hypercosm, OMAR, Hypercosm 3D Player, and Hypercosm Studio are trademarks

More information

Practical Example: Building Reports for Bugzilla

Practical Example: Building Reports for Bugzilla Practical Example: Building Reports for Bugzilla We have seen all the components of building reports with BIRT. By this time, we are now familiar with how to navigate the Eclipse BIRT Report Designer perspective,

More information

WEB TRADER USER MANUAL

WEB TRADER USER MANUAL WEB TRADER USER MANUAL Web Trader... 2 Getting Started... 4 Logging In... 5 The Workspace... 6 Main menu... 7 File... 7 Instruments... 8 View... 8 Quotes View... 9 Advanced View...11 Accounts View...11

More information

Webropol 2.0 Manual. Updated 5.7.2012

Webropol 2.0 Manual. Updated 5.7.2012 Webropol 2.0 Manual Updated 5.7.2012 Contents 1. GLOSSARY... 2 1.1. Question types... 2 1.2. Software Glossary... 3 1.3. Survey Glossary... 3 1.4. Reporting Glossary... 5 1.5. MyWebropol Glossary... 5

More information

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Types of Variables Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Quantitative (numerical)variables: take numerical values for which arithmetic operations make sense (addition/averaging)

More information

Financial Econometrics MFE MATLAB Introduction. Kevin Sheppard University of Oxford

Financial Econometrics MFE MATLAB Introduction. Kevin Sheppard University of Oxford Financial Econometrics MFE MATLAB Introduction Kevin Sheppard University of Oxford October 21, 2013 2007-2013 Kevin Sheppard 2 Contents Introduction i 1 Getting Started 1 2 Basic Input and Operators 5

More information

Regression Clustering

Regression Clustering Chapter 449 Introduction This algorithm provides for clustering in the multiple regression setting in which you have a dependent variable Y and one or more independent variables, the X s. The algorithm

More information