How Far is too Far? Statistical Outlier Detection



Similar documents
A Review of Statistical Outlier Methods

Final Exam Practice Problem Answers

Using R for Linear Regression

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

2 Sample t-test (unequal sample sizes and unequal variances)

Lecture 1: Review and Exploratory Data Analysis (EDA)

Error Type, Power, Assumptions. Parametric Tests. Parametric vs. Nonparametric Tests

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

Simple linear regression

Exercise 1.12 (Pg )

MTH 140 Statistics Videos

Session 7 Bivariate Data and Analysis

Hypothesis testing - Steps

Estimation of σ 2, the variance of ɛ

Example: Boats and Manatees

THE KRUSKAL WALLLIS TEST

Applying Statistics Recommended by Regulatory Documents

Multiple Linear Regression

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

HYPOTHESIS TESTING: POWER OF THE TEST

Analytical Methods: A Statistical Perspective on the ICH Q2A and Q2B Guidelines for Validation of Analytical Methods

Part 2: Analysis of Relationship Between Two Variables

Simple Regression Theory II 2010 Samuel L. Baker

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

BA 275 Review Problems - Week 6 (10/30/06-11/3/06) CD Lessons: 53, 54, 55, 56 Textbook: pp , ,

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

II. DISTRIBUTIONS distribution normal distribution. standard scores

2. Linear regression with multiple regressors

Confidence Intervals for the Difference Between Two Means

Geostatistics Exploratory Analysis

Chapter 13 Introduction to Linear Regression and Correlation Analysis

CALCULATIONS & STATISTICS

STAT 350 Practice Final Exam Solution (Spring 2015)

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Descriptive Statistics

How To Check For Differences In The One Way Anova


Study Guide for the Final Exam

Premaster Statistics Tutorial 4 Full solutions

Chapter 7: Simple linear regression Learning Objectives

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

How To Test For Significance On A Data Set

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Comparing Two Groups. Standard Error of ȳ 1 ȳ 2. Setting. Two Independent Samples

Descriptive Statistics

Chapter 7 Section 1 Homework Set A

Exploratory data analysis (Chapter 2) Fall 2011

Variables Control Charts

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

MULTIPLE REGRESSION EXAMPLE

Elements of statistics (MATH0487-1)

3.4 Statistical inference for 2 populations based on two samples

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

12: Analysis of Variance. Introduction

Two Related Samples t Test

P(every one of the seven intervals covers the true mean yield at its location) = 3.

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

A Determination of g, the Acceleration Due to Gravity, from Newton's Laws of Motion

2013 MBA Jump Start Program. Statistics Module Part 3

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

EQUATIONS and INEQUALITIES

5. Linear Regression

Nominal and Real U.S. GDP

N-Way Analysis of Variance

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 23. Inferences for Regression

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

General Method: Difference of Means. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n 1, n 2 ) 1.

Standard Deviation Estimator

ICH Topic Q 2 (R1) Validation of Analytical Procedures: Text and Methodology. Step 5

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

CHI-SQUARE: TESTING FOR GOODNESS OF FIT

Statistical Models in R

2. Filling Data Gaps, Data validation & Descriptive Statistics

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Statistics 151 Practice Midterm 1 Mike Kowalski

4. Continuous Random Variables, the Pareto and Normal Distributions

List of Examples. Examples 319

Chapter 4 and 5 solutions

Simple Linear Regression Inference

Correlation and Simple Linear Regression

Homework 8 Solutions

AP * Statistics Review. Descriptive Statistics

22. HYPOTHESIS TESTING

UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST

Lecture 2. Summarizing the Sample

AP Statistics Solutions to Packet 2

Section A. Index. Section A. Planning, Budgeting and Forecasting Section A.2 Forecasting techniques Page 1 of 11. EduPristine CMA - Part I

p ˆ (sample mean and sample

Tutorial 5: Hypothesis Testing

Module 5: Multiple Regression Analysis

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Mean = (sum of the values / the number of the value) if probabilities are equal

Transcription:

How Far is too Far? Statistical Outlier Detection Steven Walfish President, Statistical Outsourcing Services steven@statisticaloutsourcingservices.com 30-325-329

Outline What is an Outlier, and Why are they Important? Why Study Outliers? How to Handle Outliers? Boxplots Trimmed Means Confidence Intervals II. Outlier Identification The Extreme Studentized Deviate Test (ESD) Generalized ESD Dixon-Type Tests Shapiro-Wilks Test III. Regression Influential Observations Effect of Outliers on Slope, Intercept and R-Squared Residual Analysis IV. Comparing the Methods Which test to use when VI. Interactive Exercise Interactive exercise to review outlier methods. Determine how to handle outliers. Deciding when to retest, when to reject and when to do nothing. 2

What is an outlier? An outlier is an observation that appears to be inconsistent with other observations in the data set. An observation that has a low probability of being a member of the target distribution. A discordant observations that is brought to the attention of the investigator. 3

Why Study Outliers? Outliers provide useful information about the data. Outliers can come from a location shift or a scale shift in the data. Outliers can come from a gross recording error or measurement error. 4

Why Study Outliers? Sample time or sample frame errors (i.e. morning versus afternoon). Sometimes the value is plausible, but unexpected. 5

How to Handle Outliers How to record the data? Always keep and record the outlier observation. They provide valuable information for future study. Attempt to identify an assignable cause. Revisit outlier observations as more data becomes available. How to handle the data analysis? 6

Outlier Labeling Prior to doing any statistical analysis, data should be reviewed and checked for assumptions. Graphical methods can be used to visually accept assumptions such as normality and the lack of outliers. Some methods include box plots, trimmed means and confidence intervals. 7

Box Plots A box plot is a graphical representation of dispersion of the data. The graphic represents the lower quartile (Q ), upper quartile (Q 3 ), and median. The box includes the range of scores falling into the middle 50% of the distribution. 8

Example of Box Plots Boxplot of um/ml 90000 80000 70000 60000 um/ml 50000 40000 30000 20000 0000 0 9

Decomposition of a Box Plot Upper Fence Q 3 k(q 3 -Q ) Usually k=.5 Median Q 3 -Q Q Lower Fence k(q3-q) Outside Observations 0

Trimmed Means Assume we want to estimate the population average for purity. Clearly, the sample mean would be the logical choice. The mean of the 0 lots is 96.85. Though we know from the previous box plot, that the 90. purity value is flagged as far out when compared to the other observations. Since the smallest value might unduly influence the mean, we might want to drop it from the analysis. To preserve symmetry, we would also drop the largest value. The trimmed mean of the remaining 8 observations would be 97.36.

Using Ordered Observations Instead of removing a single observation, we might be interested in having a 00α% trimmed mean. Assume we want a 5% trimmed mean: n r Calculate r as α(n+), X i rounded to the i = r + T = closest integer. n 2 r 2

The Steps Using the Purity Data r=0.5(0+)=.65 which rounds to 2. If we order the observations 90. 95.4 96. 96.3 97.6 97.9 98. 98.3 99.2 99.5 Calculating T by eliminating the smallest 2 and largest 2 observations yields a trimmed mean of 97.38 Mean of all 0 observations = 96.85 Mean of trimmed (n=8) = 97.36 Mean of trimmed (n=6) = 97.38 3

Confidence Intervals The standard method for labeling potential outliers is the ±3s or ±2s limits on the mean. The standard formula would be: X t n * ± ( α 2; ) s n Potential outliers influence the estimate of the standard deviation (s) and the sample mean Winsorized variances in conjunction with the trimmed mean can be used for robust methods. 4

Outlier Identification Most outlier tests rely on the normal distribution. If the data is heavy tailed there is a likelihood for false positives. Later we will discuss which tests to use in a comparative workshop. 5

Extreme Studentized Deviate Test The extreme studentized deviate (ESD) test is quite good at identifying a single outlier. We declare x j to be an outlier when T s { x x / } tabled value = max s > i i Where x j is the observation leading to the max T s 6

Critical Values for ESD N α=0.05 α=0.0 0 2.29 2.48 2.35 2.56 2 2.4 2.64 3 2.46 2.70 4 2.5 2.76 5 2.55 2.8 7

Masking The extreme studentized deviate method is susceptible to masking. Masking occurs when discordant observations cancel the effect of more extreme observations Given the following observations: 5.4;5.5; 5.6; 5.9; 6.; 6.3; 6.6; 7.6; 5.4; 5.8 T s is.92 (x j =5.8) Critical value is 2.29 8

Generalized ESD A method for dealing with masking is the idea of testing a prespecified number of outliers called the generalized ESD. This method is quite robust when n is large (>25) and masking effects are possible. Compute T s for all observations. Remove the observation with the largest value of T s and recompute with the n- observations. Continue this process until all possible T s are calculated. 9

Critical Values N α=0.05 α=0.0 # Suspect Outliers 2 3 2 3 0 2.29 2.22 2.3 2.48 2.39 2.27 2.35 2.29 2.22 2.56 2.48 2.39 2 2.4 2.35 2.29 2.64 2.56 2.48 3 2.46 2.4 2.35 2.70 2.64 2.56 4 2.5 2.46 2.4 2.76 2.70 2.64 5 2.55 2.5 2.46 2.8 2.76 2.70 20

Assuming 2 Outliers Observation x i T s x i T s 5.4 0.647 5.4 0.555 2 5.5 0.623 5.5 0.523 3 5.6 0.598 5.6 0.49 4 5.9 0.524 5.9 0.397 5 6. 0.474 6. 0.333 6 6.3 0.425 6.3 0.270 7 6.6 0.35 6.6 0.76 8 7.6 0.04 7.6 0.40 9 5.4.824 5.4 2.605 0 5.8.922 Mean 8.02 7.6 Std Deviation 4.05 3.7 Maximum Ts.92 2.60 Critical Value 2.29 2.22 2

Dixon Type Tests Methods based on ordered statistics. Flexibility of testing a specific observation as an outlier. Excellent for small sample size. No need to assume a distribution such as the normal distribution. 22

Dixon Tests Testing the largest observation as an outlier r 0 x = n n x n x x r 0 = x 2 n Testing the smallest x observation x as an outlier. x 23

Critical Values for Dixon Tests (α=0.05) N r 0 3 0.94 4 0.765 5 0.642 6 0.560 7 0.507 8 0.468 9 0.437 0 0.42 24

Example 4.5 4.9 5.5 6. 6.7 7.2 0.2 4.3 r 0 =0.48 25

Outliers in Regression Analysis Two different types of observations have an impact on regression analysis Influential observations Outliers Residual analysis is a methodology used to identify outliers. 26

Regression Analysis Regression analysis or least squares estimation is a statistical technique to establish a linear relationship between two variables. These techniques are highly sensitive to outliers and influential observations. Two parameters are estimated; intercept and slope. 27

Simple Regression Model Y=β 0 +β X Y β Units Unit β o X 28

Influential and Outlier Observations Outliers are usually a single observation that usually do not influence the overall regression model. Influential observations are usually a single observation at the beginning or end of the x range that highly influences the model. 29

Influential Observations 30

Effect of Outliers on Slope, Intercept and R 2 An outlier impacts the slope intercept and R 2 in different ways. The slope can be pulled up or down based on the direction of the outlier. The intercept is more robust to outliers, but can be impacted by influential observations. 3

Comparing the Outlier Detection Methods So which method is best? Outlier methods should be select PRIOR to data collection. Should be specified in a procedure. 32

Comparison Grid Method Sample Size Number of Outliers Ease of Use Box plot Any size Multiple Most statistical software ESD Large Multiple Moderately easy Dixon Small One or Two Easily calculated 33

Using Outlier Detection in OOS The cgmp regulations require that statistically value quality control criteria include appropriate acceptance and/or rejection levels ( 2.65(d)). Outlier tests have no applicability in cases where the variability in the product is what is being assessed, such as content uniformity, dissolution, or release rate determinations. 34

Handling Data in Retest Procedures Three possible ways to handle data: Remove the data; using only the retest procedure. Keep the data; retest and combine data. Retest and compare both sets of data to determine if you can pool the data. 35

Interactive Workshop Day Potency Day Potency Day Potency 9 2 4 3 32 06 2 7 3 04 07 2 38 3 3 20 2 28 3 45 04 2 3 3 2 3 2 3 3 46 00 2 05 3 23 32 2 34 3 36 5 2 32 3 35 06 2 6 3 9 27 2 44 3 23 74 2 3 3 44 36

Boxplot Boxplot of Potency 80 70 60 50 Potency 40 30 20 0 00 37

Extreme Studentized Deviate Looking at all 36 data points, critical value=2.99 Day Potency T i Day Potency T i Day Potency T i 2 3 3 2 2 3 2 3 27 28 23 23 3 3 3 3 32 32 32 0.00 0.073 0.240 0.240 0.26 0.26 0.26 0.26 0.324 0.324 0.324 2 3 3 3 2 2 2 2 2 20 34 9 9 35 36 7 6 38 3 4 0.428 0.449 0.49 0.49 0.52 0.574 0.66 0.679 0.700 0.867 0.888 3 3 3 2 3 44 45 46 07 06 06 05 04 04 5 00.076.38.20.243.306.306.368.43.43.54.68 3 2 0.366 2 44.076 74 2.956 38

Extreme Studentized Deviate Looking at Day, 2 data points, critical value=2.4 Day Potency T i 20 27 9 3 32 07 06 06 04 00 5 0.4 0.79 0.86 0.36 0.406 0.733 0.779 0.779 0.870.052.272 74 2.32 39

Final Thoughts Statistical methods for identifying potential outliers in a data set are powerful methods to assist in any data analysis project. Pre-specify the method prior to data collection. Pre-specify the handling of outliers and retest procedures. 40