data visualization and regression

Similar documents
Mixed-effects regression and eye-tracking data

STATISTICA Formula Guide: Logistic Regression. Table of Contents

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

MIXED MODEL ANALYSIS USING R

Applications of R Software in Bayesian Data Analysis

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Introducing the Multilevel Model for Change

Generalized Linear Models

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Regression III: Advanced Methods

11. Analysis of Case-control Studies Logistic Regression

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Linear Mixed-Effects Modeling in SPSS: An Introduction to the MIXED Procedure

5. Linear Regression

Final Exam Practice Problem Answers

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

GLM I An Introduction to Generalized Linear Models

Introduction to Longitudinal Data Analysis

ANOVA. February 12, 2015

Introduction to Regression and Data Analysis

1 Theory: The General Linear Model

Regression 3: Logistic Regression

Penalized regression: Introduction

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Statistical Models in R

SAS Software to Fit the Generalized Linear Model

Chapter 7: Simple linear regression Learning Objectives

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

SAS Syntax and Output for Data Manipulation:

UNDERSTANDING THE TWO-WAY ANOVA

Introduction to Hierarchical Linear Modeling with R

VI. Introduction to Logistic Regression

Chapter 5 Analysis of variance SPSS Analysis of variance

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Basic Statistical and Modeling Procedures Using SAS

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Lecture 15: mixed-effects logistic regression

Chapter 4 Models for Longitudinal Data

HLM software has been one of the leading statistical packages for hierarchical

Psychology 205: Research Methods in Psychology

When to use Excel. When NOT to use Excel 9/24/2014

Examining a Fitted Logistic Model

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Introduction to Data Analysis in Hierarchical Linear Models

International Statistical Institute, 56th Session, 2007: Phil Everson

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Recall this chart that showed how most of our course would be organized:

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Simple Predictive Analytics Curtis Seare

Multivariate Logistic Regression

Technical report. in SPSS AN INTRODUCTION TO THE MIXED PROCEDURE

Mixed-effects regression models

Statistical Models in R

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Lecture 14: GLM Estimation and Logistic Regression

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

Additional sources Compilation of sources:

Module 5: Introduction to Multilevel Modelling. R Practical. Introduction to the Scottish Youth Cohort Trends Dataset. Contents

Logistic Regression (a type of Generalized Linear Model)

Qualitative vs Quantitative research & Multilevel methods

13. Poisson Regression Analysis

Simple linear regression

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

240ST014 - Data Analysis of Transport and Logistics

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

Statistics, Data Analysis & Econometrics

The Basic Two-Level Regression Model

Simple Regression Theory II 2010 Samuel L. Baker

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Electronic Thesis and Dissertations UCLA

List of Examples. Examples 319

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

Analysis of Variance. MINITAB User s Guide 2 3-1

Description. Textbook. Grading. Objective

Failure to take the sampling scheme into account can lead to inaccurate point estimates and/or flawed estimates of the standard errors.

Logistic regression (with R)

Outline. Dispersion Bush lupine survival Quasi-Binomial family

Multinomial and Ordinal Logistic Regression

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

SPSS Guide: Regression Analysis

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

Simple Linear Regression Inference

Improved Interaction Interpretation: Application of the EFFECTPLOT statement and other useful features in PROC LOGISTIC

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Multiple Linear Regression in Data Mining

Section 6: Model Selection, Logistic Regression and more...

2. Simple Linear Regression

5. Multiple regression

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Transcription:

data visualization and regression Sepal.Length 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 I. setosa I. versicolor I. virginica I. setosa I. versicolor I. virginica Species Species

data visualization: pros and cons There is no statistical tool that is as powerful as a well-chosen graph. (William Cleveland) dimensionality: 2-D, maybe 3-D in 2-D type of data we often work with makes visualization harder univariate visualization is still a good tool if assumptions are met, regression very useful can use vis. to check assumptions are met

what is our kind of data? sociolinguistic (response) variables usually binary predictor variables often categorical (factors) in part because of limitations of RBRUL software usu. 100s of observations from 10s of speakers often interested in predictors on two levels social or external: gender, age, social class, etc. linguistic or internal: phon. context, gram. categories traditionally analyzed with ordinary regression

what is regression? what s a model? regression is descriptive stats: size of effects regression is inferential stats: are effects > 0, are two categories equal (p-values!) demonstration using R always use a script most basic function is lm( ) for linear regression simple linear regression: one predictor lm(y ~ x) plot(y ~ x)

regression terminology distinction between predictors of interest and control predictors I prefer response and predictors errors or residuals( )

regression assumptions independence (of residuals) linearity normality (of residuals) omitted variable bias logistic regression (with a binary response) has fewer assumptions

goodness of fit: R 2 regression is an attempt to account for the variability in a data set with linear regression, you can calculate how much of the variation has been accounted for this is called R 2 it ranges from 0 to 1

extensions of linear regression GLM (generalized linear models) logistic regression log-odds of the response: ln(p / (1 p)) Poisson regression: responses that are counts etc. all these can be called fixed-effects models meaning: not mixed-effects models

logistic regression the general norm in quantitative sciences is linear regression with continuous predictors in sociolinguistics, the norm is logistic regression with categorical predictors in logistic regression, the predictors still have linear effects and combinations of effects but the effect is not on the 0 s and 1 s directly but on the log-odds: ln(p / (1 p)) residuals work differently because of 0 or 1

basics of R what is R? command-line interface, but don t use it use scripts and execute one part at a time (how) we assign models to objects (give them names) we can then examine the models and compare the models, find the best model best data format rows are observations, columns are variables easy in Excel, save as.csv, then in R, use read.csv( )

basic fixed-effects regression in R the function: lm( ), glm( ), lmer( ), glmer( ), other the formula: y ~ x1 + x2 + the family gaussian (linear), binomial (logistic), poisson (Poisson), others > m1 <- function(formula, data=, family= ) print methods: > print(m1) or just > m1 summary methods: > summary(m1) anova methods: > anova(m1) or > anova(m1, m2)

mixed-effects models: why? what? 4 6 1 3 5 7 3 5 7 9 2 4 6 8 4 6 8 10 0 2 4 6 2 4 6 8

mixed-effects models: why? what? 4 6 3 4 4 5 5 6 6 7 4 5 5 6 6 7 7 8 2 3 3 4 4 5 5 6

why a different kind of model? if we leave out the speaker (or similar) level and there is any variation at that level: independence assumption is violated omitted variable bias may be occurring if we try to include the speaker (or similar): collinearity problem impossible to divide effect between speaker and between-speaker variables

four ways fixed effects can fail 1) they overestimate the significance of betweenspeaker predictors 2) if speakers have different amounts of data, size of between-speaker predictor effects can be wrong 3) if speakers have different balances of the other predictors, size of within-speaker effects wrong 4) in logistic regression, general shrinking of effects

how mixed effects do better they account for the speaker (etc.) level by estimating the population variance of speakers the inference (p-values) now reflects the real hierarchical structure of the data they have the same familiar fixed-effects part

random-effect estimates are not quite the same as fixed-effect estimates are called BLUPs (best linear unbiased predictors) or conditional modes they are not true parameters of the model rather, the group variances are the parameters but, we can inspect the BLUPs as if they were part of the model

goodness of fit: a problem one drawback to mixed models: no obvious analog of R 2 harder to say how much has been explained for example, if speakers are being controlled for we can test if e.g. age, sex, class is significant but the more those fixed effects explain, the less the speaker random effect explains

fitting mixed-effects models in R > lm(y ~ 1 + x, data) > glm(y ~ 1 + x, data, family = gaussian) > glm(y ~ 1 + x, data, family = binomial) > lmer(y ~ 1 + x + (1 s), data) > glmer(y ~ 1 + x + (1 s), data, family = binomial) > glmer(y ~ 1 + x + (1+x s), data, family = binomial)

the formula: fixed-effects part same as in a fixed-effects model! everything you did, you do the same way ideally there is a parallel between the fixed and random effect specifications maximal random-effect structure means: every term in the fixed effects has its place(s) in the random effects, and mostly vice versa

the formula: random-effects part identify grouping factors (goes after symbol) if more than one, can be nested or crossed simplest random effects are random intercepts ~ 1 + gender + (1 speaker) speaker is a group! ~ 1 + gender + (1 speaker) + (1 small.group) ~ 1 + gender + freq. + (1 speaker) + (1 word) between-spkr. variables need spkr. random int. between-word variables need word random int.

the formula: random-effects part the intercept can usually vary between groups if the effects might too, you need random slopes ~ 1 + gender + freq. + (1 speaker) + (1 word) gender can t vary by speaker, freq. can t by word! gender could vary by word, freq. could by spkr. ~ 1 + gender + freq. + (1 + freq. speaker) + (1 + gender word) random slopes can cause slow/bad model fitting tip: center any continuous predictors tip: drop slopes for predictors not of interest

the formula: shorthand 1 means intercept and is optional ~ 1 + x is the same as ~ x 0 means no intercept (rarely needed) ~ 0 + x * is for interactions ~ x1 * x2 is the same as y ~ x1 + x2 + x1:x2 ^ is for more than one interaction ~ (x1 + x2 + x3) ^ 2 equals ~ x1*x2 + x1*x3 + x2*x3 transformations: log(x), I(x^2), anything else!

categorical predictors: contrasts the estimate for a continuous predictor is always: what is the change in y for a one-unit increase in x? y could be the response itself, the log-odds of it, etc. for a categorical predictor with k levels: there are k-1 coefficients to be estimated binary: one coefficient easy: difference between if k > 2, several systems of contrasts are used treatment : levels compared to one baseline (0) sum : levels are deviations from mean of all (0)

more about contrasts changing contrasts does not change the model changing contrasts does affect the model output with interactions, contrasts become complicated can change the baseline with relevel( ) in treatment contrasts, the missing level is 0 in sum contrasts, it is 0 - the sum of the others missing levels frustrating Rbrul shows all levels treatment: (0), 1, 2 sum: -1, 0, (1)

working with mixed-effects models

anatomy of the (g)lmer output > lmer(y ~ shape * color + (1 speaker), d) Linear mixed model fit by REML ['lmermod']" Formula: y ~ shape * color + (1 speaker) " Data: d " REML criterion at convergence: 6.9096 " Random effects:" Groups Name Std.Dev." speaker (Intercept) 0.81074 " Residual 0.05714 " Number of obs: 16, groups: speaker, 8" Fixed Effects:" (Intercept) shape_triangle " 2.99198 2.01150 " color_red shape_triangle:color_red " 2.00804-0.04212 "

working w/ fixed-effects estimates Fixed Effects:" (Intercept) shape_triangle " 2.99198 2.01150 " color_red shape_triangle:color_red " 2.00804-0.04212 "

working w/ random-effects estimates Random effects:" Groups Name Std.Dev." speaker (Intercept) 0.81074 " Residual"" " " 0.05714"

p-values from within a model > summary(model)" Fixed effects:" Estimate Std. Error t value" (Intercept) 2.99198 0.40637 7.36" shape_triangle 2.01150 0.04041 49.78" color_red 2.00804 0.57470 3.49" shape_triangle:color_red -0.04212 0.05714-0.74" install.packages( lmertest )!"

p-values from comparing models > anova(m, mm)" Models:" mm: y ~ shape + color + (1 speaker)" m: y ~ shape * color + (1 speaker)" Df AIC BIC loglik deviance Chisq Chi Df Pr(>Chisq)" mm 5 7.9098 11.773 1.0451-2.0902 " m 6 9.2163 13.852 1.3918-2.7837 0.6935 1 0.405" test entire predictors (or interactions) test contrasts w/in predictor, combining levels test the random effects themselves some argue that this is not necessary larger questions over what belongs in a model

more mixed-effects models in R other R packages besides lme4 ordinal mgcv - GAM(M)s MCMCglmm

mixed-effects models beyond R SAS JAGS/BUGS (Bayesian) MLwin BayesX

visualizing mixed-effects models lattice package ( trellis plots ) effects package

some books I can recommend try Rbrul? > source( http://www.danielezrajohnson.com/rbrul.r ) email support available at d.e.johnson@lancaster.ac.uk