A MULTIVARIATE OUTLIER DETECTION METHOD



Similar documents
Multiple group discriminant analysis: Robustness and error rate

Data Preparation and Statistical Displays

Least Squares Estimation

Data Mining: Algorithms and Applications Matrix Math Review

Factor Analysis. Chapter 420. Introduction

Convex Hull Probability Depth: first results

APPLIED MISSING DATA ANALYSIS

2 Robust Principal Component Analysis

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Local Anomaly Detection for Network System Log Monitoring

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

12: Analysis of Variance. Introduction

Simple Linear Regression Inference

DEALING WITH THE DATA An important assumption underlying statistical quality control is that their interpretation is based on normal distribution of t

Linear Threshold Units

Exploratory Data Analysis

Session 7 Bivariate Data and Analysis

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Geostatistics Exploratory Analysis

The KaleidaGraph Guide to Curve Fitting

Lecture 2. Summarizing the Sample

Normality Testing in Excel

PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION

Principal Component Analysis

Factor Analysis. Principal components factor analysis. Use of extracted factors in multivariate dependency models

Multivariate Normal Distribution

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

How Far is too Far? Statistical Outlier Detection

Local outlier detection in data forensics: data mining approach to flag unusual schools

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

5/31/ Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

November 08, S8.6_3 Testing a Claim About a Standard Deviation or Variance

CORRELATION ANALYSIS

Probability analysis of blended coking coals

Constrained curve and surface fitting

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

Statistical Machine Learning

On Mardia s Tests of Multinormality

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory Data Analysis with MATLAB

Visualization of missing values using the R-package VIM

SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12

Analyzing Structural Equation Models With Missing Data

Gamma Distribution Fitting

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Lecture 3: Linear methods for classification

Fairfield Public Schools

Imputing Values to Missing Data

Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R.

Raitoharju, Matti; Dashti, Marzieh; Ali-Löytty, Simo; Piché, Robert

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

II. DISTRIBUTIONS distribution normal distribution. standard scores

Package EstCRM. July 13, 2015

Java Modules for Time Series Analysis

12.5: CHI-SQUARE GOODNESS OF FIT TESTS

A Software Tool for. Automatically Veried Operations on. Intervals and Probability Distributions. Daniel Berleant and Hang Cheng

Econometrics Simple Linear Regression

Reflection and Refraction

COMMON CORE STATE STANDARDS FOR

Week 1. Exploratory Data Analysis

Benchmarking Low-Volatility Strategies

Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1

Association Between Variables

A simplified implementation of the least squares solution for pairwise comparisons matrices

Environmental Remote Sensing GEOG 2021

Modelling the Scores of Premier League Football Matches

Tail-Dependence an Essential Factor for Correctly Measuring the Benefits of Diversification

Descriptive Statistics

Active Versus Passive Low-Volatility Investing

Part 2: Analysis of Relationship Between Two Variables

An Introduction to Modeling Stock Price Returns With a View Towards Option Pricing

A Brief Introduction to SPSS Factor Analysis

OPTIMAL PORTFOLIO ALLOCATION WITH CVAR: A ROBUST

S. Boyd EE102. Lecture 1 Signals. notation and meaning. common signals. size of a signal. qualitative properties of signals.

Simple Regression Theory II 2010 Samuel L. Baker

How To Write A Data Analysis

Data, Measurements, Features

Gaussian Conjugate Prior Cheat Sheet

Multivariate normal distribution and testing for means (see MKB Ch 3)

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Using R for Linear Regression

Introduction to Data Visualization

Subspace Analysis and Optimization for AAM Based Face Alignment

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Module 3: Correlation and Covariance

Descriptive Analysis

PLASTIC REGION BOLT TIGHTENING CONTROLLED BY ACOUSTIC EMISSION MONITORING

START Selected Topics in Assurance

Simple linear regression

Plastic Card Fraud Detection using Peer Group analysis

APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS.

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Transcription:

A MULTIVARIATE OUTLIER DETECTION METHOD P. Filzmoser Department of Statistics and Probability Theory Vienna, AUSTRIA e-mail: P.Filzmoser@tuwien.ac.at Abstract A method for the detection of multivariate outliers is proposed which accounts for the data structure and sample size. The cut-off value for identifying outliers is defined by a measure of deviation of the empirical distribution function of the robust Mahalanobis distance from the theoretical distribution function. The method is easy to implement and fast to compute. 1 Introduction Outlier detection belongs to the most important tasks in data analysis. The outliers describe the abnormal data behavior, i.e. data which are deviating from the natural data variability. Often outliers are of primary interest, for example in geochemical exploration they are indications for mineral deposits. The cut-off value or threshold which divides anomalous and non-anomalous data numerically is often the basis for important decisions. Many methods have been proposed for univariate outlier detection. They are based on (robust) estimation of location and scatter, or on quantiles of the data. A major disadvantage is that these rules are independent from the sample size. Moreover, by definition of most rules (e.g. mean ±2 scatter) outliers are identified even for clean data, or at least no distinction is made between outliers and extremes of a distribution. The basis for multivariate outlier detection is the Mahalanobis distance. The standard method for multivariate outlier detection is robust estimation of the parameters in the Mahalanobis distance and the comparison with a critical value of the χ 2 distribution (Rousseeuw and Van Zomeren, 1990). However, also values larger than this critical value are not necessarily outliers, they could still belong to the data distribution. In order to distinguish between extremes of a distribution and outliers, Garrett (1989) introduced the χ 2 plot, which draws the empirical distribution function of the robust Mahalanobis distances against the χ 2 distribution. A break in the tail of the distributions is an indication for outliers, and values beyond this break are iteratively deleted. The approach of Garrett (1989) needs a lot of interaction of the analyst with the data since this method is not an automatic procedure. We propose a method which computes the outlier threshold adaptively from the data. By investigation of the tails of the difference between the empirical and a hypothetical distribution function we find an adaptive threshold value which increases with sample size if there are no outliers and which is bounded in presence of outliers. 1

2 Methods for Multivariate Outlier Detection The shape and size of multivariate data are quantified by the covariance matrix. A well-known distance measure which takes into account the covariance matrix is the Mahalanobis distance. For a p-dimensional multivariate sample x i (i = 1,..., n) the Mahalanobis distance is defined as MD i = ( (x i t) T C 1 (x i t) ) 1/2 for i = 1,..., n (1) where t is the estimated multivariate location and C the estimated covariance matrix. Usually, t is the multivariate arithmetic mean, and C is the sample covariance matrix. For multivariate normally distributed data the values are approximately chi-square distributed with p degrees of freedom (χ 2 p). Multivariate outliers can now simply be defined as observations having a large (squared) Mahalanobis distance. For this purpose, a quantile of the chi-squared distribution (e.g., the 97.5% quantile) could be considered. However, this approach has several shortcomings. The Mahalanobis distances need to be estimated by a robust procedure in order to provide reliable measures for the recognition of outliers. Single extreme observations, or groups of observations, departing from the main data structure can have a severe influence to this distance measure, because both location and covariance are usually estimated in a non-robust manner. Many robust estimators for location and covariance have been introduced in the literature. The minimum covariance determinant (MCD) estimator is probably most frequently used in practice, partly because a computationally fast algorithm is available (Rousseeuw and Van Driessen, 1999). Using robust estimators of location and scatter in formula (1) leads to so-called robust distances (RD). Rousseeuw and Van Zomeren (1990) used these RDs for multivariate outlier detection. If the squared RD for an observation is larger than, say, χ 2 p;0.975, it can be declared a candidate outlier. This approach, however has shortcomings: It does not account for the sample size n of the data, and, independently from the data structure, observations could be flagged as outliers even it they belong to the data distribution. A better procedure than using a fixed threshold is to adjust the threshold to the data set at hand. Garrett (1989) used the chi-square plot for this purpose, by plotting the squared Mahalanobis distances (which have to be computed at the basis of robust estimations of location and scatter) against the quantiles of χ 2 p, the most extreme points are deleted until the remaining points follow a straight line. The deleted points are the identified outliers. This method, however, is not automatic, it needs user interaction and experience on the part of the analyst. Moreover, especially for large data sets, it can be time consuming, and also to some extent it is subjective. In the next section a procedure that does not require analyst intervention, is reproducible and therefore objective, into consideration is introduced. 3 An Adaptive Method The chi-square plot is useful for visualizing the deviation of the data distribution from multivariate normality in the tails. This principle is used in the following. Let G n (u) 2

denote the empirical distribution function of the squared robust distances RD 2 i, and let G(u) be the distribution function of χ 2 p. For multivariate normally distributed samples, G n converges to G. Therefore the tails of G n and G can be compared to detect outliers. The tails will be defined by δ = χ 2 p;1 α for a certain small α (e.g., α = 0.025), and ( p n (δ) = sup G(u) Gn (u) ) + u δ is considered, where + indicates the positive differences. In this way, p n (δ) measures the departure of the empirical from the theoretical distribution only in the tails, defined by the value of δ. p n (δ) can be considered as a measure of outliers in the sample. Gervini (2003) used this idea as a reweighting step for the robust estimation of multivariate location and scatter. The efficiency of the estimator could be improved considerably. p n (δ) will not be directly used as a measure of outliers. The threshold should be infinity in case of multivariate normally distributed data, i.e. extreme values or values from the same distribution should not be declared as outliers. Therefore a critical value p crit is introduced, which helps to distinguish between outliers and extremes. The measure of outliers in the sample is then defined as α n (δ) = { 0 if pn (δ) p crit (δ, n, p) p n (δ) if p n (δ) > p crit (δ, n, p). The threshold value which will be called adjusted quantile is then determined as c n (δ) = G 1 n (1 α n (δ)). The critical value for distinguishing between outliers and extremes can be derived by simulation, and the result is approximately p crit (δ, n, p) = 0.24 0.003 p n for δ = χ 2 p,0.975 and p 10 and p crit (δ, n, p) = 0.252 0.0018 p n for δ = χ 2 p,0.975 and p > 10 (see Filzmoser, Reimann, and Garrett, 2003). 4 Example We simulate a data set in two dimensions in order to simplify the graphical visualization. 85 data points follow a bivariate standard normal distribution. Multivariate outliers are introduced by 15 points coming from a bivariate normal distribution with mean (2, 2) T and covariance matrix diag(1/10, 1/10). The data are presented in Figure 1. The MCD estimator is applied and the robust distances are computed. Figure 2 shows in more detail how the adaptive outlier detection method works. The plot shows the empirical distribution function of the ordered squared distances (points) and the theoretical distribution function of χ 2 2 (solid line). The dotted line χ 2 2;0.975 identifies only four outliers which are presented in the left part of Figure 3 as dark points. The adjusted quantile gives a more realistic rule, and the identified outliers are shown in the right part of Figure 3. 3

Figure 1: Simulated bivariate data with outliers shown as dark points. Cumulative probability 0.0 0.2 0.4 0.6 0.8 1.0 Adjusted quantile 97.5% quantile 0 2 4 6 8 10 Ordered squared robust distance Figure 2: Empirical distribution function of the ordered squared distances (points) and theoretical distribution function of χ 2 2 (solid line). 4

Outliers based on 97.5% quantile Outliers based on adaptive method Figure 3: Resulting outliers (dark points) according to χ 2 2;0.975 (left) and the adaptive quantile (right). 5 Summary An automated method to identify outliers in multivariate space was developed. The method compares the difference between the empirical distribution of the squared robust distances and the distribution function of the chi-square distribution. The method accounts not only for different dimension of the data but also for different sample size. References [1] Filzmoser P., Reimann C., Garrett R.G. (2003). Multivariate outlier detection in exploration geochemistry. Technical report TS 03-5, Department of Statistics, Vienna University of Technology, Austria. Dec. 2003. [2] Garrett R.G. (1989). The chi-square plot: A tool for multivariate outlier recognition. Journal of Geochemical Exploration. Vol. 32, pp. 319-341. [3] Gervini D. (2003). A robust and efficient adaptive reweighted estimator of multivariate location and scatter. Journal of Multivariate Analysis. Vol. 84, pp. 116-144. [4] Rousseeuw P.J., Van Driessen K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics. Vol. 41, pp. 212-223. [5] Rousseeuw P.J., Van Zomeren B.C. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association. Vol. 85(411), pp. 633-651. 5