Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity of variance Independence Data exploration Describe distribution of data transform if required and appropriate logs, square/fourth root Check assumptions of analysis Evaluate fit of model Find patterns in multivariate data Smallest value Boxplot Median 5% of values 5% of values 5 1 15 5 3 35 Length Largest value 1. SYMMETRICAL EQUAL VARIANCES. SKEWED 7 Outliers 5 Count 3 3. OUTLIERS. UNEQUAL VARIANCES 1 1 3 5 7 9 Limpet numbers per quadrat 1

Scatterplots Model residuals Plotting bivariate data Value of two variables recorded for each observation Each variable plotted on one axis (X or Y) Symbols represent each observation Assess relationship between two variables 3 1 1 3 Residual is difference between observed and predicted value of response variable regression model ( yi y$ i) ANOVA model ( yij yi ) Standardised (studentised) residuals residual/ SE residuals follow a t-distribution Normality Y normally distributed at each value of X: boxplots of Y, separate for each group if appropriate, should be symmetrical - watch out for outliers and skewness transformations of Y often help regression and ANOVA tests robust to this assumption Homogeneity of variance Variance (spread) of Y should be constant for each value of x i (homogeneity of variance): skewed populations or outliers produce unequal variances transformations that improve normality of Y will also usually make variance of Y more constant Plots of residuals in regression ANOVA checks Residual y +ve -ve x Predicted y i Residual y +ve -ve x Predicted y i Plot residuals (or variances) against group means Tests for equal variances Bartlett s, Cochran s, Levene s tests ANOVA reliable if group n s are equal and variances not too different: ratio of largest to smallest variance 3:1 Variance Residuals Mean Mean

Independence Values of Y are independent of each other: no replicate used more than once observations independent within and between groups watch out for data which are a time series on same experimental or sampling units should be considered at design stage Repeated measures analyses suitable for some non-independent designs Linearity (regression) True population relationship between Y and X is linear: scatterplot of Y against X watch out for asymptotic or exponential patterns transformations of Y or Y and X often help Transformations Transform variables to new scale e.g. degrees Fahrenheit to degrees Celsius Statistical transformations non-linear (changes shape of distribution) monotonic (retains rank order of values) If Y (therefore error terms) skewed: log or power transformation of Y improves homogeneity of variance can reduce influence of outliers If nonlinear relationship: linearise by transformation of Y and/or X Data transformations Common transformations for biol data log, square or th root for skewed continuous distributions arcsin for proportions and % Transformed variables must make biological sense Transformation issues Mussel clumps Zeros in skewed distributions log (y + constant) or power transformation Power transformations th root useful for abundance data with large range Base [1 or natural (e)] for log transformations makes no difference to result Arcsin for % or proportions little effect unless close to zero or 1 Presentation of results back transformation of means and errors Generalised linear models non-normal error distributions 3

Other regression diagnostics 3 3 1 1 Check assumptions Check fit of model 1 3 5 1 15 5 3 Warn about influential observations and outliers Anscombe (1973) data set R =.7, y = 3. +.5x, t =., P =. 1 1 1 1 1 1 1 1 5 1 15 5 1 15 5 1 15 5 1 15 1 1 1 1 1 1 1 1 1 1 1 1 5 1 15 5 1 15 5 1 15 5 1 15 Outliers Influence Unusual sample values very different from rest of sample detect using boxplots Sample values along way from fitted model detect by analysing residuals from fitted model Solutions if impossible values, delete and adjust df run analysis twice, outliers in and outliers omitted if result changes problems! Cook s D statistic: calculated for each observation measures change in regression slope if observation omitted observations with large D have large influence on estimated slope also large residual

Y 1 Assumptions not met - regression 3 X Observation 1 is X and Y outlier but not influential Transformations useful Non-parametric tests robust regression LAD, ranks randomisation tests randomise observations or residuals Smoothing functions Observation has large residual outlier Observation 3 is very influential (large Cook s D) - also outlier Smoothers Nonparametric description of relationship between Y and X unconstrained by specific model structure Useful exploratory technique: is linear model appropriate? are particular observations influential? Used in generalized additive modeling (GAM) Smoothers Each observation replaced by value reflecting neighbouring observations mean or median or predicted value of regression model through neighbouring observations Window size determines neighbouring observations size of window (number of observations) determined by smoothing parameter Adjacent windows overlap resulting line is smooth smoothness controlled by smoothing parameter (size of windows) Any section of line robust to values in other windows Types of smoothers Running (moving) means or averages: means or medians within each 3 window Lo(w)ess: locally weighted regression scatterplot smoothing observations within window 1 weighted differently observations replaced by predicted values from local regression line 1 3 Assumptions not met - ANOVA Robust if equal n Transformations useful Non-parametric tests rank transform tests Kruskal-Wallis for single factor designs ranks inappropriate for testing interaction terms randomisation tests randomises observations or residuals 5

Generalized linear models Select distribution for response variable poisson, binomial, lognormal Logistic models binary data Log-linear models count data in contingency tables Outliers Observations further from fitted model than remaining observations might be different from sample outliers in boxplots Large residual outlier