Lecture 4: Tools for data analysis, exploration, and transformation: plyr and reshape2
|
|
|
- Judith Holland
- 10 years ago
- Views:
Transcription
1 Lecture 4: Tools for data, exploration, and transformation: and 2 LSA 2013, Brain and Cognitive Sciences University of Rochester December 3, 2013
2 manipulation and exploration with and Split-combine: wide (2) Today we ll look at two data manipulation tools which are flexible and powerful, but easy to use once you grasp a few concepts. First is, which extends functional programming tools in R (like l) and makes the common data- split--combine procedure easy and elegant. Second is (2), which makes it easy to change the format of data frames and arrays from wide (observations spread across columns) and long (observations spread across rows) formats. Both are written by Hadley Wickham (like ggplot2). (you can download the knitr source for these slides on my website)
3 split--combine freqs <- ddply(lexdec,.(subject), function(df) {coef(lm(rt~frequency, data=df))}) Split-combine: wide (2) ## Subject (Intercept) Frequency ## 1 A ## 2 A ## 3 A ## 4 C ## 5 D ## 6 I RT A1 A2 A3 C D I J K M1 M2 P R1 R2 R3 S T1 T2 V W1 W2 Z Frequency
4 split--combine Split-combine: wide (2) is built around the conceptual structure of split--combine split your data up in some way. some function to each part. combine the results into the output structure. This is a common structure in many data tasks, and R already has some facilities for it. unifies these in a single interface and provides some nice helper, and also makes the split--combine structure explicit. Before we get to itself, let s have a short review of some basic functional programming concepts.
5 : how do they work Split-combine: wide (2) Formally: take some input, do something, and produce some output. You use in R all the time. Most of the you re familiar with have names and are built in (or provided by libraries). mean(runif(100)) ## [1] But there s nothing special about in R, they re objects, just like any other data type This means they can, for instance, be assigned to variables: f <- mean f(runif(100)) ## [1]
6 (anonymous) : how do they work Split-combine: wide (2) are objects that are created with the function keyword function(x) sum(x)/length(x) ## function(x) sum(x)/length(x) ## <environment: 0x1044f3710> are by their nature anonymous in R, and have no name, in the same way that the vector c(1,2,3) is just an object, with no intrinsic name (this is unlike other languages, like Java or C). New can be assigned to variables to be called over and over again mymean <- function(x) sum(x)/length(x) mymean(runif(100)) ## [1] mymean(runif(100, 1, 2)) ## [1] or just evaluated once (function(x) sum(x)/length(x)) (runif(100)) ## [1] (notice the parentheses around the whole function definition)
7 , environments, and closures Split-combine: wide (2) A function object lists an environment when it s printed to the console This is because are really closures in R. They include information about the values of variables when the function was created. You can take advantage of this to make function factories : make.power <- function(n) { return(function(x) x^n) } my.square <- make.power(2) my.square(3) ## [1] 9 (make.power(4)) (2) ## [1] 16 See Hadley Wickham s excellent chapter on functional programming in R for more on this: Functional-programming
8 : how do they work Split-combine: wide (2) Function declarations have three parts: 1 The function keyword 2 Comma-separated list of function arguments 3 The body of the function, which is an expression (multi-statement expression should be enclosed in braces {}). The value of the expression is used for the returned value of the function if no return statement is encountered in the body. For instance: mean.and.var <- function(x) { m <- mean(x) v <- var(x) data.frame(mean=m, var=v) }
9 a few function tips Split-combine: wide (2) The ellipsis... can be included in the arguments list and captures any arguments not specifically named. This is useful to pass on other arguments to other function calls in the body (as we ll see later). You can specify default values for arguments by argument=default. R has very sophisticated argument resolution when a function is called. It first assigns named arguments by name, and then unnamed arguments are assigned positionally to unfilled arguments. So you can say something like sd(rnorm(sd=5, mean=1, 100)) ## [1] where the last argument is interpreted as n, even though the specification of rnorm calls for n to be first: rnorm ## function (n, mean = 0, sd = 1) ##.Internal(rnorm(n, mean, sd)) ## <bytecode: 0x > ## <environment: namespace:stats>
10 Your first foray into functional programming Split-combine: wide (2) R is a functional programming language at its heart. One of the most basic operations of functional programming is to a function individually to items in a list. In base R, this is done via l (for list-) and : list.o.nums <- list(runif(100), rnorm(100), rpois(100, lambda=1)) l(list.o.nums, mean) ## [[1]] ## [1] ## ## [[2]] ## [1] ## ## [[3]] ## [1] 0.89 The big three in R are l (takes and returns a list), s (like l but attempts to simplify output into a vector or matrix), and (which works on arrays).
11 unleash the power of anonymous Split-combine: wide (2) When combined with l and, anonymous are extremely powerful. You could, for instance, run a simulation with a range of parameter values: s(1:10, function(n) rpois(5, lambda=n)) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## [1,] ## [2,] ## [3,] ## [4,] ## [5,] Or repeat the same simulation multiple times, calculating a summary statistics for each repetition: s(1:10, function(n) mean(rnorm(n=5, mean=0, sd=1))) ## [1] ## [8]
12 split--combine Split-combine: wide (2) You might also use s calculate the mean RT (for instance) for each subject, by using split to create a list of each subject s RTs: data(lexdec, package='languager') RT.bysub <- with(lexdec, split(rt, Subject)) RT.means.bysub <- s(rt.bysub, mean) head(data.frame(rt.mean=rt.means.bysub)) ## RT.mean ## A ## A ## A ## C ## D ## I 6.253
13 split--combine Split-combine: wide (2) This is a common data- task: split up the data in some way, analyze each piece, and then put the results back together again. The package (Wickham, 2011) was designed to facilitate this process. For instance, instead of that split/s combo, we could use the ddply function: library() head(ddply(lexdec,.(subject), function(df) data.frame(rt.mean=mean(df$rt)))) ## Subject RT.mean ## 1 A ## 2 A ## 3 A ## 4 C ## 5 D ## 6 I 6.253
14 split--combine Split-combine: wide (2) The ddply call has three parts: ddply(lexdec,.(subject), function(df) data.frame(rt.mean=mean(df$rt))) 1 The data, lexdec 2 The splitting variables,.(subject). The.() function is a utility function which quotes a list of variables or expressions. We could just as easily have used the variable names as strings c("subject") or (one-sided) formula notation ~Subject. 3 The function to to the individual pieces. In this case, the function takes a data.frame as input and returns a data.frame which has one variable RT.mean. The splitting variables are automatically added before the results are combined.
15 : input Split-combine: wide (2) Plyr commands are named based on their input and output. The first letter refers to the format of the input. The input determines how the data is split: d*ply(.data,.variables,.fun,...) takes data frame input and splits it into subsets based on the unique combinations of the.variables. l*ply(.data,.fun,...) takes list input, splitting the list and passing each element to.fun. a*ply(.data,.margins,.fun,...) takes array input, and splits it into sub arrays by.margins (just like base R ). For instance, if.margins = 1 and.data is a three-d array, then.data[1,, ],.data[2,, ],... are each passed to.fun.
16 : output Split-combine: wide (2) The second letter of the command name refers to the output format The output determines how the data is combined at the end: *dply takes the result of its.fun and turns it into a data frame, then adds the splitting variables (values of.variables for ddply, list names for ldply, or array dimnames for adply) before rbinding the individual data frames together *lply just returns a list of the result of ing.fun to each individual split, just like l, but additionally adds names based on the splitting variables. *aply tries to assemble the output of.fun into a big array, where the combine dimensions are the last ones. For instance, if.fun returns a two-dimensional array (always of the same size), and there were three splitting variables or dimensions originally, then the output would be a five-dimensional array, with dimensions 1 to 2 corresponding to the.fun output dimensions and 2 to 5 the splitting variables.
17 try it: output behavior of commands Split-combine: wide (2) Task Let s use the lexdec data set to explore the output behavior of. Start with this ddply call (copy and paste from the slides pdf, or from the accompanying.r file): library() data(lexdec, package='languager') # load the dataset if it isn't already ddply(lexdec,.(prevtype, Class), function(df) with(df, data.frame(meanrt=mean(rt)))) 1 What does this do? Look at the lexdec data frame, run the command, and interpret the output. 2 Are these numbers really different? Change the function to also return the variance (or standard deviation or standard error, or whatever other measure you think might be useful). 3 Change it to return a list instead using dlply. The output might look a little funny. Why? Use str to investigate the output. 4 Now make it return an array using daply. The output will probably look totally wrong. Why? (Hint: use str to look at the output, again). Fix it so that it does what you d expect/like it to do.
18 subset Split-combine: wide (2) An added level of convenience comes from the fact that any extra arguments to, e.g., ddply are passed to the function which operates on each piece This means you can use like subset or transform which take a data frame and return another data frame. For instance, to find the trial with the slowest RT for each subject, split by Subject and then use subset: slowesttrials <- ddply(lexdec,.(subject), subset, RT==max(RT)) head(slowesttrials[, c('subject', 'RT', 'Trial', 'Word')]) ## Subject RT Trial Word ## 1 A tortoise ## 2 A lion ## 3 A radish ## 4 C frog ## 5 D chicken ## 6 I snake This is equivalent to both ddply(lexdec,.(subject), function(df,...) subset(df,...), RT==max(RT)) ddply(lexdec,.(subject), function(df) subset(df, RT==max(RT)))
19 transform Split-combine: wide (2) Another super convenient function is transform, which adds variables to a data frame (or replaces them) using expressions evaluated using the data frame as the environment (like the with function). For instance, we often standardize measures before regression (center and possibly scale). If the reaction time distributions of individual subjects are very different, then we might want to standardize them for each subject individually. In verbose ddply, we could do ddply(lexdec,.(subject), function(df) { df$rt.s <- scale(df$rt) return(df) }) However, we can be more concise using transform: lexdecscaledrt <- ddply(lexdec,.(subject), transform, RT.s=scale(RT)) This expresses very transparently what we re trying to do: transform the data by adding a variable for the scaled (zero mean and unit sd) reaction time.
20 transform 4 Split-combine: density RT density RT.s wide (2)
21 summarise Split-combine: wide (2) also provides the convenience function summarise (with an s!). This function, like transform, takes the form summarise(.data, summvar1=expr1, summvar2=expr2,...) but unlike transform it creates a new data frame with only the specified summary variables.
22 summarise Split-combine: wide (2) For instance, to find the mean and variance of each subject s RT, we could use lexdec.rtsumm <- ddply(lexdec,.(subject), summarise, mean=mean(rt), var=var(rt)) head(lexdec.rtsumm) ## Subject mean var ## 1 A ## 2 A ## 3 A ## 4 C ## 5 D ## 6 I This is more concise than the similar example a few slides ago ddply(lexdec,.(subject), function(df) with(df, data.frame(meanrt=mean(rt)))) and, like with transform, makes our intentions much clearer.
23 transform+summarise exercises Split-combine: wide (2) Task Let s investigate the relative ordering of RTs for different words 1 Add a new variable RTrank which is the rank-order of the RT for each trial, by subject. That is, RTrank=1 for that subject s fastest trial, 2 for the second-fastest, etc. Hint: rank finds the rank indices of a vector. 2 Find the average RT rank for each word, using summarise. 3 Plot the relationship between the word frequencies and their average rank. 4 If you re feeling fancy, put errorbars on the words showing the 25% and 75% quantiles. 5 Which word has the highest average RT rank? The lowest?
24 transform+summarise solution Split-combine: wide (2) lexdec <- ddply(lexdec,.(subject), transform, RTrank=rank(RT)) word.rt.ranks <- ddply(lexdec,.(word, Frequency), summarise, RTrank25=quantile(RTrank, 0.25), RTrank75=quantile(RTrank, 0.75), RTrank=mean(RTrank)) ggplot(word.rt.ranks, aes(x=frequency, y=rtrank, ymin=rtrank25, ymax=rtrank75)) + geom_pointrange() RTrank Frequency subset(word.rt.ranks, RTrank %in% c(max(rtrank), min(rtrank))) ## Word Frequency RTrank RTrank25 RTrank75 ## 3 apple ## 75 vulture
25 example use cases Split-combine: wide (2) Let s go through some exampes of how you might use for and exploration. Exploring models through simulation.
26 Checking distribution of errors Split-combine: wide (2) Let s check to see whether the errors in lexdec responses are evenly distributed across native and non-native speakers. There are a couple of ways to do this. We could use ddply and summarise like above: ddply(lexdec,.(nativelanguage), summarise, acc=mean(correct=='correct')) ## NativeLanguage acc ## 1 English ## 2 Other We could also use daply to get an array of raw counts of correct and incorrect responses, by splitting on NativeLanguage and Correct and then extracting the number of rows in each split: daply(lexdec,.(nativelanguage, Correct), nrow) ## Correct ## NativeLanguage correct incorrect ## English ## Other
27 Checking distribution of errors Split-combine: wide (2) Why might we want the latter option? It s the format that chisq.test expects: correctcounts <- daply(lexdec,.(nativelanguage, Correct), nrow) chisq.test(correctcounts) ## ## Pearson's Chi-squared test with Yates' continuity correction ## ## data: correctcounts ## X-squared = 4.884, df = 1, p-value =
28 Estimate the frequency effect for each subject Split-combine: wide (2) Let s estimate the effect of frequency on RT for each subject separately (perhaps to get a sense of whether to include random slopes in a mixed effects model). We can do this using a combination of ddply and coef: subjectslopes <- ddply(lexdec,.(subject), function(df) {coef(lm(rt~frequency, data=df))}) We can see that these slopes show a fair amount of variability, summary(subjectslopes$frequency) ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## so it might make sense to include random slopes in later regression modeling.
29 Estimate the frequency effect for each subject Split-combine: wide (2) As a sanity check, we can also plot the fitted regression lines against the original data points: ggplot(lexdec, aes(x=frequency, y=rt)) + geom_point() + facet_wrap(~subject) + geom_abline(data=subjectslopes, aes(slope=frequency, intercept=`(intercept)`), color='red') RT A1 A2 A3 C D I J K M1 M2 P R1 R2 R3 S T1 T2 V W1 W2 Z Frequency
30 data exploration and errorbars Split-combine: wide (2) Investigate interaction between frequency and native language background. Let s first construct a factor variable which is a binary high frequency-low frequency variable. lexdec <- transform(lexdec, FreqHiLo=factor(ifelse(Frequency>median(Frequency), 'high', 'low'), levels=c('low', 'high'))) Then, use ddply and summarise to create a summary table for your conditions of interest se <- function(x) sd(x)/sqrt(length(x)) langfreq.summ <- ddply(lexdec,.(nativelanguage, FreqHiLo), summarise, mean=mean(rt), se=se(rt))
31 data exploration and errorbars We can quickly plot condition means and 95% CIs using ggplot (etc.) and the ddply output ggplot(langfreq.summ, aes(x=freqhilo, color=nativelanguage, y=mean, ymin=mean-1.96*se, ymax=mean+1.96*se)) + geom_point() + geom_errorbar(width=0.1) + geom_line(aes(group=nativelanguage)) Split-combine: mean NativeLanguage English Other wide (2) 6.3 low FreqHiLo high
32 Wow! What a huge effect! Split-combine: wide (2)...but wait. Calculating the standard error in this way assumes that, for the purposes of comparing the condition means, each observed RT is an independent draw from the same normal distribution. But in fact, they are not: observations are grouped both by subject and by item (word). One way of dealing with this: look at the by-subject standard error, by averaging within each subject and then treating each subject as an independent draw from the underlying condition. This is the by-subject, like the F1 ANOVA.
33 by-subject standard errors with ddply Split-combine: wide (2) Computing the by-subject standard errors is a two-step process, both of which can be done with a single ddply command: 1 Average within each subject and combination of conditions: langfreq.bysub <- ddply(lexdec,.(nativelanguage, FreqHiLo, Subject), summarise, RT=mean(RT)) 2 Then, calculate the condition means and standard errors as before: langfreq.bysub.summ <- ddply(langfreq.bysub,.(nativelanguage, FreqHiLo), summarise, mean=mean(rt), se=se(rt))
34 by-subject standard errors When we plot the resulting error bars, the effects look much smaller compared to the variability across subjects (and small number of subjects): ggplot(langfreq.bysub.summ, aes(x=freqhilo, color=nativelanguage, y=mean, ymin=mean-1.96*se, ymax=mean+1.96*se)) + geom_point() + geom_errorbar(width=0.1) + geom_line(aes(group=nativelanguage)) Split-combine: 6.7 mean NativeLanguage English Other wide (2) low FreqHiLo high
35 try it: by-item standard errors Split-combine: wide (2) Task Use ddply to do the F2 or by-item, finding by-item standard errors, treating the words as items. Plot the resulting errorbars, and use this (and the actual ddply output) to interpret the results.
36 by-item standard errors (solution) Split-combine: wide (2) langfreq.byitem <- ddply(lexdec,.(nativelanguage, FreqHiLo, Word), summarise, RT=mean(RT)) langfreq.byitem.summ <- ddply(langfreq.byitem,.(nativelanguage, FreqHiLo), summarise, mean=mean(rt), se=se(rt)) ## ggplot(langfreq.byitem.summ, aes(x=freqhilo, color=nativelanguage, ## y=mean, ymin=mean-2*se, ymax=mean+2*se)) + ## geom_point() + ## geom_errorbar(width=0.1) + ## geom_line(aes(group=nativelanguage))
37 A few other tricks Split-combine: wide (2) The m*ply and r*ply are wrappers for other forms which make simulations more convenient. Check out the documentation (?mdply) and the article in J. Stat. Software. Because they combine things in nice ways, can help R data play nicely together. To concatenate a list of data frames into one big data frame, you can use ldply(list.of.dfs, I) (the identity function I just returns its input). To shatter an array into a data frame where the dimension names are stored in columns, you can use adply(an.array, 1:ndim(an.array), I). Any margins left out then index rows. If any dimensions are named, they will be transferred to the data frame in a smart way. Use subset and ddply to remove outliers subject-by-subject Check balance (number of trials/subjects in each condition) using nrow and ddply of daply (like the correct/incorrect example).
38 Simulating data sets and model exploration Split-combine: density (Intercept) freqlow model lm rand.int rand.slope wide (2) t value
39 Simulating data sets and model exploration Split-combine: wide (2) The best way to understand a model is to simulate fake data and see what the model does with it. Frequently the process goes as follows: 1 Pick some range of parameter values (random effect variance vs. residual variance). 2 Generate some data using those parameters 3 Fit model to that data, and record summary statistics. This fits well within the split--combine pattern of. For example, let s look at how not accounting for random slopes and intercepts inflates Type I error rates. We ll generate fake data for a binary frequency variable which has a true effect of 0, then fit lm and lmer models with random intercept and slope. Let s start simple, fitting the models to one set of parameters, repeating the simulation 100 times.
40 step 1: simulate data Split-combine: wide (2) library() library(mvtnorm) library(lme4) make.data.generator <- function(true.effects=c(0,0), resid.var=1, ranef.var=diag(c(1,1)), n.subj=24, n.obs=24 ) { # create design matrix for our made up experiment data.str <- data.frame(freq=factor(c(rep('high', n.obs/2), rep('low', n.obs/2)))) contrasts(data.str$freq) <- contr.sum(2) model.mat <- model.matrix(~ 1 + freq, data.str) generate.data <- function() { # sample data set under mixed effects model with random slope/intercepts simulated.data <- rdply(n.subj, { beta <- t(rmvnorm(n=1, sigma=ranef.var)) + true.effects expected.rt <- model.mat %*% beta epsilon <- rnorm(n=length(expected.rt), mean=0, sd=sqrt(resid.var)) data.frame(data.str, RT=expected.RT + epsilon) }) names(simulated.data)[1] <- 'subject' simulated.data } }
41 step 2: fit model Split-combine: wide (2) fit.models <- function(simulated.data) { # fit models and extract coefs lm.coefs <- coefficients(summary(lm(rt ~ 1+freq, simulated.data)))[, 1:3] rand.int.coefs <- summary(lmer(rt ~ 1+freq + (1 subject), simulated.data))@coefs rand.slope.coefs <- summary(lmer(rt ~ 1+freq + (1+freq subject), simulated.data))@coefs # format output all pretty rbind(data.frame(model='lm', predictor=rownames(lm.coefs), lm.coefs), data.frame(model='rand.int', predictor=rownames(rand.int.coefs), rand.int.coefs), data.frame(model='rand.slope', predictor=rownames(rand.slope.coefs), rand.slope.coefs)) }
42 step 3: put it together + repeat Split-combine: wide (2) gen.dat <- make.data.generator() simulations <- rdply(.n=100, fit.models(gen.dat()),.progress='text') head(simulations) ##.n model predictor Estimate Std..Error t.value ## 1 1 lm (Intercept) ## 2 1 lm freqlow ## 3 1 rand.int (Intercept) ## 4 1 rand.int freqlow ## 5 1 rand.slope (Intercept) ## 6 1 rand.slope freqlow daply(simulations,.(model, predictor), function(df) type1err=mean(abs(df$t.value)>1.96)) ## predictor ## model (Intercept) freqlow ## lm ## rand.int ## rand.slope
43 step 4: visualize # use 2:: to get the data into a more convenient format (see next section) ggplot(simulations, aes(x=t.value, color=model)) + geom_vline(xintercept=c(-1.96, 1.96), color='#888888', linetype=3) + scale_x_continuous('t value') + geom_density() + facet_grid(predictor~.) Split-combine: 0.4 wide (2) density t value (Intercept) freqlow model lm rand.int rand.slope
44 different parameter values Split-combine: wide (2) What if we want to run the simulation with different sets of parameter values? Create a data frame of parameters, using expand.grid on arguments which have the same names as the arguments to make.data.generator. head(params <- expand.grid(n.obs=c(4, 16, 64), n.subj=c(4, 16, 64))) ## n.obs n.subj ## ## ## ## ## ## And then use mdply on the result. man.simulations <- mdply(params, function(...) { make.data <- make.data.generator(...) rdply(.n=100, fit.models(make.data())) },.progress='text')
45 digression: mdply mdply(params, function(...) { make.data <- make.data.generator(...) rdply(.n=100, fit.models(make.data())) },.progress='text') Split-combine: wide (2) mdply is like Map: it passes the variables in a data frame (split row-by-row) as named arguments to the function. The function(...) {} syntax means that the function will accept any named arguments, and then recycle them wherever the... occurs anywhere inside the body. Thus, this mdply will pass the columns of params as arguments to the make.data.generator() function, no matter which parameters you specify. This specific example (where the parameters are n.obs and n.subj) is equivalent to: ddply(params,.(n.obs, n.subj), function(df) { make.data <- make.data.generator(n.obs=df$n.obs, n.subj=df$n.subj) rdply(.n=100,.fun=fit.models(make.data())) },.progress='text')
46 2 Split-combine: wide (2) (If we have time): talk about changing format of data using and from the 2 package.
47 What does your data look like? Split-combine: wide (2) doesn t always look like tools like ggplot (or ddply) expect it to. What if your experiment spits out a data file where each trial is a different column? The lexical decision data set might look like this: ## Subject 23_RT 23_Correct 24_RT 24_Correct 25_RT 25_Correct ## 1 A correct <NA> <NA> <NA> <NA> ## 2 A correct <NA> <NA> correct ## 3 A3 <NA> <NA> <NA> <NA> <NA> <NA> ## 4 C correct <NA> <NA> <NA> <NA> ## 5 D <NA> <NA> correct <NA> <NA> ## 6 I <NA> <NA> correct <NA> <NA> ## 7 J correct <NA> <NA> <NA> <NA> (there are missing values because non-word trials are excluded)
48 Wide vs. long data There are two ways of structuring data: Split-combine: wide (2) ## Subject Trial variable value ## 1 A1 23 RT ## 2 A1 27 RT ## 3 A1 29 RT ## 4 A1 30 RT ## 5 A1 32 RT long Each observation gets exactly one row, with values in id columns giving identifying information (like subject, trial, whether the observed value is a correct/incorrect response or a RT observation, etc.) ## Subject 23_RT 23_Correct 24_RT ## 1 A correct <NA> ## 2 A correct <NA> ## 3 A3 <NA> <NA> <NA> ## 4 C correct <NA> ## 5 D <NA> <NA> wide Each row contains all the observations for a unique combination of identifying variables (say, one subject). Column names identify the kind of observation in that row (trial number, observation type).
49 reshaping data Split-combine: wide (2) Converting between wide and long data representations is a common task in data (especially data import/cleaning) The 2 package streamlines this process in R. (Most of the functionality of 2 is a special case of what does).
50 and Split-combine: wide (2) Two main in 2 converts an array or data frame into a long format. d and a convert molten data into a range of different shapes from long to wide.
51 with you, (I ll stop the world and) Split-combine: wide (2) Let s start with an example. Here s the wide data frame from above: ld.wide[1:5, 1:7] ## Subject 23_RT 23_Correct 24_RT 24_Correct 25_RT 25_Correct ## 1 A correct <NA> <NA> <NA> <NA> ## 2 A correct <NA> <NA> correct ## 3 A3 <NA> <NA> <NA> <NA> <NA> <NA> ## 4 C correct <NA> <NA> <NA> <NA> ## 5 D <NA> <NA> correct <NA> <NA> When you data, you have to specify which variables (columns) are id variables and which are measure variables. head(ld.m <- (ld.wide, id.var='subject', na.rm=t)) ## Subject variable value ## 1 A1 23_RT ## 2 A2 23_RT ## 4 C 23_RT ## 7 J 23_RT ## 8 K 23_RT ## 10 M2 23_RT
52 Split-combine: wide (2) Now use str_split (from the stringr package) to separate the trial number and measure information. require(stringr) trials.and.vars <- ldply(str_split(ld.m$variable, '_')) names(trials.and.vars) <- c('trial', 'measure') str_split returns a list of splits but we can use ldply to convert to a dataframe, to which we add informative names. The extracted trial numbers and RT/correct indicators can then be combined with the ed data with cbind. head(ld.m <- cbind(ld.m, trials.and.vars)) ## Subject variable value Trial measure ## 1 A1 23_RT RT ## 2 A2 23_RT RT ## 4 C 23_RT RT ## 7 J 23_RT RT ## 8 K 23_RT RT ## 10 M2 23_RT RT
53 syntax Split-combine: wide (2) To specify the measure and id variables, use measure.vars= and id.vars= arguments. You can specify them as indices (column numbers) or names. will try to guess the id and measure variables if you don t specify them. If you specify only measure vars, will treat the other variables as id variables (and vice-versa) If you want some variables ommitted, specify the measure and id variables that you want and the others will be dropped.
54 Split-combine: wide (2) gets your data into a raw material that can be easily converted to other more useful formats. Molten data can be converted to different shapes using the * commands. d creates a data frame, and a creates an array. Both commands take molten data and a formula which defines the new shape.
55 Split-combine: wide (2) d takes a two-dimensional formula. The left hand side tells which variables determine the rows, and the right side the columns Let s put RT and correct in their own columns. The ld.m$measure variable indicates whether the ld.m$value is an RT or a correct measure, so we put that variable on the right-hand side of the formula. head(d(ld.m, Subject+Trial ~ measure)) ## Subject Trial Correct RT ## 1 A1 100 correct ## 2 A1 102 correct ## 3 A1 106 correct ## 4 A1 108 correct ## 5 A1 109 correct ## 6 A1 111 correct
56 Split-combine: wide (2) We can also use the shorthand... to indicate all other (non-value) variables: head(d(ld.m,... ~ measure)) ## Subject variable Trial Correct RT ## 1 A1 23_RT 23 <NA> ## 2 A1 23_Correct 23 correct <NA> ## 3 A1 27_RT 27 <NA> ## 4 A1 27_Correct 27 correct <NA> ## 5 A1 29_RT 29 <NA> ## 6 A1 29_Correct 29 correct <NA> But this is no good here because ld.m$variable also encodes information about measure, so we have to remove it first to be able to use... ld.m$variable <- NULL head(d(ld.m,... ~ measure)) ## Subject Trial Correct RT ## 1 A1 100 correct ## 2 A1 102 correct ## 3 A1 106 correct ## 4 A1 108 correct ## 5 A1 109 correct ## 6 A1 111 correct
57 syntax Split-combine: wide (2) Specify the shape of the data using a formula. For each combination of the values of variables on the left-hand side, there will be one row, and likewise for columns with the right-hand side. For d, the data frame will also have left-hand variables in columns in the resulting data frame. Right-hand variables will have their values pasteed together as column names for the other columns. If you want higher-dimensional output, you can use a which creates an array (specify dimensions like dim1var1 ~ dim2var1 + dim2var2 ~ dim3var1). If there is no variable called value, then will try to guess. You can override the defaults by specifying the value.var= argument.
58 aggregation Split-combine: wide (2) If the formula results in more than one value in each cell, you need to specify an aggregating function (like in ddply) via the fun.aggregate= argument (you can abbreviate to fun.agg=). The default is length which tells you how many observations are in that cell. head(d(ld.m, Subject ~ measure)) ## Subject Correct RT ## 1 A ## 2 A ## 3 A ## 4 C ## 5 D ## 6 I The function you specify must return a single value (more constrained than ).
59 but can t I just do this in Excel? Split-combine: wide (2) You can! But it will be tedious and you will make mistakes. Using tools designed for these data-manipulation tasks makes you be explicit about the things you are doing to your data. And when you are done, you have a script which is a complete record of what you did (and, if you re using knitr, a nicely formatted report, too).
60 and () Split-combine: wide (2) The library operations are conceptually related to the split--combine logic of. Question: what s the equivalent of the command we saw before? (ld.wide, id.var='subject', na.rm=t) Remember what does: split the data by the id variables, and rearrange the measure variable columns so that they re in one value column, moving the column names into a new variable column. ddply(ld.wide,.(subject), function(df) { vars <- names(df) vals <- t(df) dimnames(vals) <- NULL return(subset(data.frame(variable=vars, value=vals), variable!= 'Subject' &!is.na(value))) })
61 and () Split-combine: wide (2) Question: how would you write the d command from before? head(d(ld.m, Subject+Trial ~ measure)) head(ddply(ld.m,.(subject, Trial), function(df) { res <- data.frame(t(df$value)) names(res) <- df$measure return(res) })) ## Subject Trial RT Correct ## 1 A correct ## 2 A correct ## 3 A correct ## 4 A correct ## 5 A correct ## 6 A correct
Bigger data analysis. Hadley Wickham. @hadleywickham Chief Scientist, RStudio. Thursday, July 18, 13
http://bit.ly/bigrdata3 Bigger data analysis Hadley Wickham @hadleywickham Chief Scientist, RStudio July 2013 http://bit.ly/bigrdata3 1. What is data analysis? 2. Transforming data 3. Visualising data
Psychology 205: Research Methods in Psychology
Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready
The F distribution and the basic principle behind ANOVAs. Situating ANOVAs in the world of statistical tests
Tutorial The F distribution and the basic principle behind ANOVAs Bodo Winter 1 Updates: September 21, 2011; January 23, 2014; April 24, 2014; March 2, 2015 This tutorial focuses on understanding rather
R: A self-learn tutorial
R: A self-learn tutorial 1 Introduction R is a software language for carrying out complicated (and simple) statistical analyses. It includes routines for data summary and exploration, graphical presentation
E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F
Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,
Final Exam Practice Problem Answers
Final Exam Practice Problem Answers The following data set consists of data gathered from 77 popular breakfast cereals. The variables in the data set are as follows: Brand: The brand name of the cereal
T O P I C 1 2 Techniques and tools for data analysis Preview Introduction In chapter 3 of Statistics In A Day different combinations of numbers and types of variables are presented. We go through these
KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management
KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To
Bill Burton Albert Einstein College of Medicine [email protected] April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1
Bill Burton Albert Einstein College of Medicine [email protected] April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Calculate counts, means, and standard deviations Produce
Introduction to R June 2006
Introduction to R Introduction...3 What is R?...3 Availability & Installation...3 Documentation and Learning Resources...3 The R Environment...4 The R Console...4 Understanding R Basics...5 Managing your
BIOL 933 Lab 6 Fall 2015. Data Transformation
BIOL 933 Lab 6 Fall 2015 Data Transformation Transformations in R General overview Log transformation Power transformation The pitfalls of interpreting interactions in transformed data Transformations
Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0.
Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged
Prepare your result file for input into SPSS
Prepare your result file for input into SPSS Isabelle Darcy When you use DMDX for your experiment, you get an.azk file, which is a simple text file that collects all the reaction times and accuracy of
Getting started with qplot
Chapter 2 Getting started with qplot 2.1 Introduction In this chapter, you will learn to make a wide variety of plots with your first ggplot2 function, qplot(), short for quick plot. qplot makes it easy
Quick Start to Data Analysis with SAS Table of Contents. Chapter 1 Introduction 1. Chapter 2 SAS Programming Concepts 7
Chapter 1 Introduction 1 SAS: The Complete Research Tool 1 Objectives 2 A Note About Syntax and Examples 2 Syntax 2 Examples 3 Organization 4 Chapter by Chapter 4 What This Book Is Not 5 Chapter 2 SAS
" Y. Notation and Equations for Regression Lecture 11/4. Notation:
Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
Week 5: Multiple Linear Regression
BUS41100 Applied Regression Analysis Week 5: Multiple Linear Regression Parameter estimation and inference, forecasting, diagnostics, dummy variables Robert B. Gramacy The University of Chicago Booth School
Using Excel for Statistical Analysis
Using Excel for Statistical Analysis You don t have to have a fancy pants statistics package to do many statistical functions. Excel can perform several statistical tests and analyses. First, make sure
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,
TIPS FOR DOING STATISTICS IN EXCEL
TIPS FOR DOING STATISTICS IN EXCEL Before you begin, make sure that you have the DATA ANALYSIS pack running on your machine. It comes with Excel. Here s how to check if you have it, and what to do if you
R Language Fundamentals
R Language Fundamentals Data Types and Basic Maniuplation Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Where did R come from? Overview Atomic Vectors Subsetting
Simple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
Programming Exercise 3: Multi-class Classification and Neural Networks
Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks
Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance
Lecture 11: Confidence intervals and model comparison for linear regression; analysis of variance 14 November 2007 1 Confidence intervals and hypothesis testing for linear regression Just as there was
Forecasting in STATA: Tools and Tricks
Forecasting in STATA: Tools and Tricks Introduction This manual is intended to be a reference guide for time series forecasting in STATA. It will be updated periodically during the semester, and will be
Exercises on using R for Statistics and Hypothesis Testing Dr. Wenjia Wang
Exercises on using R for Statistics and Hypothesis Testing Dr. Wenjia Wang School of Computing Sciences, UEA University of East Anglia Brief Introduction to R R is a free open source statistics and mathematical
How To Run Statistical Tests in Excel
How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting
An introduction to using Microsoft Excel for quantitative data analysis
Contents An introduction to using Microsoft Excel for quantitative data analysis 1 Introduction... 1 2 Why use Excel?... 2 3 Quantitative data analysis tools in Excel... 3 4 Entering your data... 6 5 Preparing
Exploratory Data Analysis and Plotting
Exploratory Data Analysis and Plotting The purpose of this handout is to introduce you to working with and manipulating data in R, as well as how you can begin to create figures from the ground up. 1 Importing
Section 14 Simple Linear Regression: Introduction to Least Squares Regression
Slide 1 Section 14 Simple Linear Regression: Introduction to Least Squares Regression There are several different measures of statistical association used for understanding the quantitative relationship
Introduction Course in SPSS - Evening 1
ETH Zürich Seminar für Statistik Introduction Course in SPSS - Evening 1 Seminar für Statistik, ETH Zürich All data used during the course can be downloaded from the following ftp server: ftp://stat.ethz.ch/u/sfs/spsskurs/
Recall this chart that showed how most of our course would be organized:
Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:
Doing Multiple Regression with SPSS Multiple Regression for Data Already in Data Editor Next we want to specify a multiple regression analysis for these data. The menu bar for SPSS offers several options:
Data exploration with Microsoft Excel: univariate analysis
Data exploration with Microsoft Excel: univariate analysis Contents 1 Introduction... 1 2 Exploring a variable s frequency distribution... 2 3 Calculating measures of central tendency... 16 4 Calculating
Hypothesis Testing for Beginners
Hypothesis Testing for Beginners Michele Piffer LSE August, 2011 Michele Piffer (LSE) Hypothesis Testing for Beginners August, 2011 1 / 53 One year ago a friend asked me to put down some easy-to-read notes
Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm
Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm
Using MS Excel to Analyze Data: A Tutorial
Using MS Excel to Analyze Data: A Tutorial Various data analysis tools are available and some of them are free. Because using data to improve assessment and instruction primarily involves descriptive and
Data Analysis Tools. Tools for Summarizing Data
Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool
THE SET ANALYSIS. Summary
THE SET ANALYSIS Summary 1 Why use the sets... 3 2 The identifier... 4 3 The operators... 4 4 The modifiers... 5 4.1 All members... 5 4.2 Known members... 6 4.3 Search string... 7 4.4 Using a boundary
CALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents
Package dsstatsclient
Maintainer Author Version 4.1.0 License GPL-3 Package dsstatsclient Title DataSHIELD client site stattistical functions August 20, 2015 DataSHIELD client site
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression
Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression Objectives: To perform a hypothesis test concerning the slope of a least squares line To recognize that testing for a
Python Loops and String Manipulation
WEEK TWO Python Loops and String Manipulation Last week, we showed you some basic Python programming and gave you some intriguing problems to solve. But it is hard to do anything really exciting until
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate
SPSS Guide: Regression Analysis
SPSS Guide: Regression Analysis I put this together to give you a step-by-step guide for replicating what we did in the computer lab. It should help you run the tests we covered. The best way to get familiar
0 Introduction to Data Analysis Using an Excel Spreadsheet
Experiment 0 Introduction to Data Analysis Using an Excel Spreadsheet I. Purpose The purpose of this introductory lab is to teach you a few basic things about how to use an EXCEL 2010 spreadsheet to do
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering
Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
5 Correlation and Data Exploration
5 Correlation and Data Exploration Correlation In Unit 3, we did some correlation analyses of data from studies related to the acquisition order and acquisition difficulty of English morphemes by both
SQL Server Array Library 2010-11 László Dobos, Alexander S. Szalay
SQL Server Array Library 2010-11 László Dobos, Alexander S. Szalay The Johns Hopkins University, Department of Physics and Astronomy Eötvös University, Department of Physics of Complex Systems http://voservices.net/sqlarray,
ANOVA. February 12, 2015
ANOVA February 12, 2015 1 ANOVA models Last time, we discussed the use of categorical variables in multivariate regression. Often, these are encoded as indicator columns in the design matrix. In [1]: %%R
Dataframes. Lecture 8. Nicholas Christian BIOST 2094 Spring 2011
Dataframes Lecture 8 Nicholas Christian BIOST 2094 Spring 2011 Outline 1. Importing and exporting data 2. Tools for preparing and cleaning datasets Sorting Duplicates First entry Merging Reshaping Missing
Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011
Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this
Multiple Linear Regression
Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is
Object systems available in R. Why use classes? Information hiding. Statistics 771. R Object Systems Managing R Projects Creating R Packages
Object systems available in R Statistics 771 R Object Systems Managing R Projects Creating R Packages Douglas Bates R has two object systems available, known informally as the S3 and the S4 systems. S3
BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
Predicting Defaults of Loans using Lending Club s Loan Data
Predicting Defaults of Loans using Lending Club s Loan Data Oleh Dubno Fall 2014 General Assembly Data Science Link to my Developer Notebook (ipynb) - http://nbviewer.ipython.org/gist/odubno/0b767a47f75adb382246
Statistical Functions in Excel
Statistical Functions in Excel There are many statistical functions in Excel. Moreover, there are other functions that are not specified as statistical functions that are helpful in some statistical analyses.
ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS
DATABASE MARKETING Fall 2015, max 24 credits Dead line 15.10. ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS PART A Gains chart with excel Prepare a gains chart from the data in \\work\courses\e\27\e20100\ass4b.xls.
Main Effects and Interactions
Main Effects & Interactions page 1 Main Effects and Interactions So far, we ve talked about studies in which there is just one independent variable, such as violence of television program. You might randomly
Lecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
17. SIMPLE LINEAR REGRESSION II
17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.
Prof. Nicolai Meinshausen Regression FS 2014. R Exercises
Prof. Nicolai Meinshausen Regression FS 2014 R Exercises 1. The goal of this exercise is to get acquainted with different abilities of the R statistical software. It is recommended to use the distributed
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Getting Started with R and RStudio 1
Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure?
Case Study in Data Analysis Does a drug prevent cardiomegaly in heart failure? Harvey Motulsky [email protected] This is the first case in what I expect will be a series of case studies. While I mention
Wave Analytics Data Integration
Wave Analytics Data Integration Salesforce, Spring 16 @salesforcedocs Last updated: April 28, 2016 Copyright 2000 2016 salesforce.com, inc. All rights reserved. Salesforce is a registered trademark of
Power Analysis for Correlation & Multiple Regression
Power Analysis for Correlation & Multiple Regression Sample Size & multiple regression Subject-to-variable ratios Stability of correlation values Useful types of power analyses Simple correlations Full
USC Marshall School of Business Academic Information Services. Excel 2007 Qualtrics Survey Analysis
USC Marshall School of Business Academic Information Services Excel 2007 Qualtrics Survey Analysis DESCRIPTION OF EXCEL ANALYSIS TOOLS AVAILABLE... 3 Summary of Tools Available and their Properties...
Lecture 25: Database Notes
Lecture 25: Database Notes 36-350, Fall 2014 12 November 2014 The examples here use http://www.stat.cmu.edu/~cshalizi/statcomp/ 14/lectures/23/baseball.db, which is derived from Lahman s baseball database
Crystal Reports Form Letters Replace database exports and Word mail merges with Crystal's powerful form letter capabilities.
Crystal Reports Form Letters Replace database exports and Word mail merges with Crystal's powerful form letter capabilities. Prerequisites: Fundamental understanding of conditional formatting formulas
Package EstCRM. July 13, 2015
Version 1.4 Date 2015-7-11 Package EstCRM July 13, 2015 Title Calibrating Parameters for the Samejima's Continuous IRT Model Author Cengiz Zopluoglu Maintainer Cengiz Zopluoglu
Exploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
Create Custom Tables in No Time
SPSS Custom Tables 17.0 Create Custom Tables in No Time Easily analyze and communicate your results with SPSS Custom Tables, an add-on module for the SPSS Statistics product line Share analytical results
Answer: C. The strength of a correlation does not change if units change by a linear transformation such as: Fahrenheit = 32 + (5/9) * Centigrade
Statistics Quiz Correlation and Regression -- ANSWERS 1. Temperature and air pollution are known to be correlated. We collect data from two laboratories, in Boston and Montreal. Boston makes their measurements
Introduction to parallel computing in R
Introduction to parallel computing in R Clint Leach April 10, 2014 1 Motivation When working with R, you will often encounter situations in which you need to repeat a computation, or a series of computations,
Distributed R for Big Data
Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for
data visualization and regression
data visualization and regression Sepal.Length 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 I. setosa I. versicolor I. virginica I. setosa I. versicolor I. virginica Species Species
Journal of Statistical Software
JSS Journal of Statistical Software MMMMMM YYYY, Volume VV, Issue II. http://www.jstatsoft.org/ The Split-Apply-Combine Strategy for Data Analysis Hadley Wickham Rice University Abstract Many data analysis
OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE
OVERVIEW OF R SOFTWARE AND PRACTICAL EXERCISE Hukum Chandra Indian Agricultural Statistics Research Institute, New Delhi-110012 1. INTRODUCTION R is a free software environment for statistical computing
Univariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
Data Cleaning 101. Ronald Cody, Ed.D., Robert Wood Johnson Medical School, Piscataway, NJ. Variable Name. Valid Values. Type
Data Cleaning 101 Ronald Cody, Ed.D., Robert Wood Johnson Medical School, Piscataway, NJ INTRODUCTION One of the first and most important steps in any data processing task is to verify that your data values
Experimental Designs (revisited)
Introduction to ANOVA Copyright 2000, 2011, J. Toby Mordkoff Probably, the best way to start thinking about ANOVA is in terms of factors with levels. (I say this because this is how they are described
Data exploration with Microsoft Excel: analysing more than one variable
Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September
MIXED MODEL ANALYSIS USING R
Research Methods Group MIXED MODEL ANALYSIS USING R Using Case Study 4 from the BIOMETRICS & RESEARCH METHODS TEACHING RESOURCE BY Stephen Mbunzi & Sonal Nagda www.ilri.org/rmg www.worldagroforestrycentre.org/rmg
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma [email protected] The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
Multiple Choice: 2 points each
MID TERM MSF 503 Modeling 1 Name: Answers go here! NEATNESS COUNTS!!! Multiple Choice: 2 points each 1. In Excel, the VLOOKUP function does what? Searches the first row of a range of cells, and then returns
Package dsmodellingclient
Package dsmodellingclient Maintainer Author Version 4.1.0 License GPL-3 August 20, 2015 Title DataSHIELD client site functions for statistical modelling DataSHIELD
