Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model



Similar documents
business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

SPSS Explore procedure

Regression Modeling Strategies

List of Examples. Examples 319

Elements of statistics (MATH0487-1)

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Error Type, Power, Assumptions. Parametric Tests. Parametric vs. Nonparametric Tests

Fairfield Public Schools

SPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011

Regression III: Advanced Methods

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Diagrams and Graphs of Statistical Data

Data analysis process


Exercise 1.12 (Pg )

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Applying Statistics Recommended by Regulatory Documents

DISCRIMINANT FUNCTION ANALYSIS (DA)

13: Additional ANOVA Topics. Post hoc Comparisons

SAS Software to Fit the Generalized Linear Model

Statistical Models in R

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

The importance of graphing the data: Anscombe s regression examples

The Dummy s Guide to Data Analysis Using SPSS

Regression III: Advanced Methods

Additional sources Compilation of sources:

GLM I An Introduction to Generalized Linear Models

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

The Statistics Tutor s Quick Guide to

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

(and sex and drugs and rock 'n' roll) ANDY FIELD

STATISTICA Formula Guide: Logistic Regression. Table of Contents

AP STATISTICS REVIEW (YMS Chapters 1-8)

MTH 140 Statistics Videos

Automated Biosurveillance Data from England and Wales,

SPSS Tests for Versions 9 to 13

You have data! What s next?

Exploratory Data Analysis

Geostatistics Exploratory Analysis

PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION

Data Transforms: Natural Logarithms and Square Roots

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Part II Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Part II

Chapter 7: Simple linear regression Learning Objectives

Logistic Regression (a type of Generalized Linear Model)

Chapter 7. One-way ANOVA

Simple Predictive Analytics Curtis Seare

Lecture 2. Summarizing the Sample

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Teaching Biostatistics to Postgraduate Students in Public Health

EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA

Analysing Questionnaires using Minitab (for SPSS queries contact -)

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

T-test & factor analysis

HLM software has been one of the leading statistical packages for hierarchical

Descriptive Statistics and Measurement Scales

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

Simple linear regression

MEASURES OF LOCATION AND SPREAD

Comparing Means in Two Populations

Algebra 1 Course Information

Analysis of Variance. MINITAB User s Guide 2 3-1

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

Examining a Fitted Logistic Model

Introduction to Minitab and basic commands. Manipulating data in Minitab Describing data; calculating statistics; transformation.

13. Poisson Regression Analysis

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Profile analysis is the multivariate equivalent of repeated measures or mixed ANOVA. Profile analysis is most commonly used in two cases:

1.5 Oneway Analysis of Variance

Data Mining Part 5. Prediction

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Organizing Your Approach to a Data Analysis

Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

BIOL 933 Lab 6 Fall Data Transformation

Week 1. Exploratory Data Analysis

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE

SPSS Modules Features Statistics Premium

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

2. Simple Linear Regression

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Simple Linear Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

5. Multiple regression

UNIT 1: COLLECTING DATA

Chapter G08 Nonparametric Statistics

Crash Course on Basic Statistics

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

THE KRUSKAL WALLLIS TEST

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Assessing Model Fit and Finding a Fit Model

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Transcription:

Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity of variance Independence Data exploration Describe distribution of data transform if required and appropriate logs, square/fourth root Check assumptions of analysis Evaluate fit of model Find patterns in multivariate data Smallest value Boxplot Median 5% of values 5% of values 5 1 15 5 3 35 Length Largest value 1. SYMMETRICAL EQUAL VARIANCES. SKEWED 7 Outliers 5 Count 3 3. OUTLIERS. UNEQUAL VARIANCES 1 1 3 5 7 9 Limpet numbers per quadrat 1

Scatterplots Model residuals Plotting bivariate data Value of two variables recorded for each observation Each variable plotted on one axis (X or Y) Symbols represent each observation Assess relationship between two variables 3 1 1 3 Residual is difference between observed and predicted value of response variable regression model ( yi y$ i) ANOVA model ( yij yi ) Standardised (studentised) residuals residual/ SE residuals follow a t-distribution Normality Y normally distributed at each value of X: boxplots of Y, separate for each group if appropriate, should be symmetrical - watch out for outliers and skewness transformations of Y often help regression and ANOVA tests robust to this assumption Homogeneity of variance Variance (spread) of Y should be constant for each value of x i (homogeneity of variance): skewed populations or outliers produce unequal variances transformations that improve normality of Y will also usually make variance of Y more constant Plots of residuals in regression ANOVA checks Residual y +ve -ve x Predicted y i Residual y +ve -ve x Predicted y i Plot residuals (or variances) against group means Tests for equal variances Bartlett s, Cochran s, Levene s tests ANOVA reliable if group n s are equal and variances not too different: ratio of largest to smallest variance 3:1 Variance Residuals Mean Mean

Independence Values of Y are independent of each other: no replicate used more than once observations independent within and between groups watch out for data which are a time series on same experimental or sampling units should be considered at design stage Repeated measures analyses suitable for some non-independent designs Linearity (regression) True population relationship between Y and X is linear: scatterplot of Y against X watch out for asymptotic or exponential patterns transformations of Y or Y and X often help Transformations Transform variables to new scale e.g. degrees Fahrenheit to degrees Celsius Statistical transformations non-linear (changes shape of distribution) monotonic (retains rank order of values) If Y (therefore error terms) skewed: log or power transformation of Y improves homogeneity of variance can reduce influence of outliers If nonlinear relationship: linearise by transformation of Y and/or X Data transformations Common transformations for biol data log, square or th root for skewed continuous distributions arcsin for proportions and % Transformed variables must make biological sense Transformation issues Mussel clumps Zeros in skewed distributions log (y + constant) or power transformation Power transformations th root useful for abundance data with large range Base [1 or natural (e)] for log transformations makes no difference to result Arcsin for % or proportions little effect unless close to zero or 1 Presentation of results back transformation of means and errors Generalised linear models non-normal error distributions 3

Other regression diagnostics 3 3 1 1 Check assumptions Check fit of model 1 3 5 1 15 5 3 Warn about influential observations and outliers Anscombe (1973) data set R =.7, y = 3. +.5x, t =., P =. 1 1 1 1 1 1 1 1 5 1 15 5 1 15 5 1 15 5 1 15 1 1 1 1 1 1 1 1 1 1 1 1 5 1 15 5 1 15 5 1 15 5 1 15 Outliers Influence Unusual sample values very different from rest of sample detect using boxplots Sample values along way from fitted model detect by analysing residuals from fitted model Solutions if impossible values, delete and adjust df run analysis twice, outliers in and outliers omitted if result changes problems! Cook s D statistic: calculated for each observation measures change in regression slope if observation omitted observations with large D have large influence on estimated slope also large residual

Y 1 Assumptions not met - regression 3 X Observation 1 is X and Y outlier but not influential Transformations useful Non-parametric tests robust regression LAD, ranks randomisation tests randomise observations or residuals Smoothing functions Observation has large residual outlier Observation 3 is very influential (large Cook s D) - also outlier Smoothers Nonparametric description of relationship between Y and X unconstrained by specific model structure Useful exploratory technique: is linear model appropriate? are particular observations influential? Used in generalized additive modeling (GAM) Smoothers Each observation replaced by value reflecting neighbouring observations mean or median or predicted value of regression model through neighbouring observations Window size determines neighbouring observations size of window (number of observations) determined by smoothing parameter Adjacent windows overlap resulting line is smooth smoothness controlled by smoothing parameter (size of windows) Any section of line robust to values in other windows Types of smoothers Running (moving) means or averages: means or medians within each 3 window Lo(w)ess: locally weighted regression scatterplot smoothing observations within window 1 weighted differently observations replaced by predicted values from local regression line 1 3 Assumptions not met - ANOVA Robust if equal n Transformations useful Non-parametric tests rank transform tests Kruskal-Wallis for single factor designs ranks inappropriate for testing interaction terms randomisation tests randomises observations or residuals 5

Generalized linear models Select distribution for response variable poisson, binomial, lognormal Logistic models binary data Log-linear models count data in contingency tables Outliers Observations further from fitted model than remaining observations might be different from sample outliers in boxplots Large residual outlier