A Review and Comparison of Methods for Detecting Outliers in Univariate Data Sets

Transcription

1 A Review ad Compariso of Methods for Detectig Outliers i Uivariate Data Sets by Sogwo Seo BS, Kyughee Uiversity, Submitted to the Graduate Faculty of Graduate School of Public Health i partial fulfillmet of the requiremets for the degree of Master of Sciece Uiversity of Pittsburgh 6

2 UNIVERSITY OF PITTSBURGH Graduate School of Public Health This thesis was preseted by Sogwo Seo It was defeded o April 6, 6 ad approved by: Laura Cassidy, Ph D Assistat Professor Departmet of Biostatistics Graduate School of Public Health Uiversity of Pittsburgh Ravi K. Sharma, Ph D Assistat Professor Departmet of Behavioral ad Commuity Health Scieces Graduate School of Public Health Uiversity of Pittsburgh Thesis Director: Gary M. Marsh, Ph D Professor Departmet of Biostatistics Graduate School of Public Health Uiversity of Pittsburgh ii

3 Gary M. Marsh, Ph D A Review ad Compariso of Methods for Detectig Outliers i Uivariate Data Sets Sogwo Seo, M.S. Uiversity of Pittsburgh, 6 Most real-world data sets cotai outliers that have uusually large or small values whe compared with others i the data set. Outliers may cause a egative effect o data aalyses, such as ANOVA ad regressio, based o distributio assumptios, or may provide useful iformatio about data whe we look ito a uusual respose to a give study. Thus, outlier detectio is a importat part of data aalysis i the above two cases. Several outlier labelig methods have bee developed. Some methods are sesitive to extreme values, like the SD method, ad others are resistat to extreme values, like Tukey s method. Although these methods are quite powerful with large ormal data, it may be problematic to apply them to oormal data or small sample sizes without kowledge of their characteristics i these circumstaces. This is because each labelig method has differet measures to detect outliers, ad expected outlier percetages chage differetly accordig to the sample size or distributio type of the data. May kids of data regardig public health are ofte skewed, usually to the right, ad logormal distributios ca ofte be applied to such skewed data, for istace, surgical procedure times, blood pressure, ad assessmet of toxic compouds i evirometal aalysis. This paper reviews ad compares several commo ad less commo outlier labelig methods ad presets iformatio that shows how the percet of outliers chages i each method accordig to the skewess ad sample size of logormal distributios through simulatios ad applicatio to real data sets. These results may help establish guidelies for the choice of outlier detectio methods i skewed data, which are ofte see i the public health field. iii

4 TABLE OF CONTENTS 1. INTRODUCTION BACKGROUND OUTLIER DETECTION METHOD STATEMENT OF PROBLEM OUTLIER LABELING METHOD STANDARD DEVIATION (SD) METHOD Z-SCORE THE MODIFIED Z-SCORE TUKEY S METHOD (BOXPLOT) ADJUSTED BOXPLOT MAD E METHOD MEDIAN RULE SIMULATION STUDY AND RESULTS FOR THE FIVE SELECTED LABELING METHODS APPLICATION RECOMMENDATIONS DISCUSSION AND CONCLUSIONS APPENDIX A... 4 THE EXPECTATION, STANDARD DEVIATION AND SKEWNESS OF A LOGNORMAL DISTRIBUTION.4 APPENDIX B... 4 MAXIMUM Z SCORE.4 APPENDIX C CLASSICAL AND MEDCOUPLE (MC) SKEWNESS..44 iv

5 APPENDIX D BREAKDOWN POINT.47 APPENDIX E PROGRAM CODE FOR OUTLIER LABELING METHODS...48 BIBLIOGRAPHY v

6 LIST OF TABLES Table 1: Basic Statistic of a Simple Data Set... Table : Basic Statistic After Chagig 7 ito 77 i the Simple Data Set... Table 3: Computatio ad Maskig Problem of the Z-Score Table 4: Computatio of Modified Z-Score ad its Compariso with the Z-Score... 1 Table 5: The Average Percetage of Left Outliers, Right Outliers ad the Average Total Percet of Outliers for the Logormal Distributios with the Same Mea ad Differet Variaces (mea=, variace=.,.4,.6,.8, 1. ) ad the Stadard Normal Distributio with Differet Sample Sizes Table 6: Iterval, Left, Right, ad Total Number of Outliers Accordig to the Five Outlier Methods vi

7 LIST OF FIGURES Figure 1: Probability desity fuctio for a ormal distributio accordig to the stadard deviatio... 5 Figure : Theoretical Chage of Outliers Percetage Accordig to the Skewess of the Logormal Distributios i the SD Method ad Tukey s Method... 7 Figure 3: Desity Plot ad Dotplot of the Logormal Distributio (sample size=5) with Mea=1 ad SD=1, ad its Logarithm, Y=log(x)... 8 Figure 4: Boxplot for the Example Data Set Figure 5: Boxplot ad Dotplot. (Note: No outlier show i the boxplot) Figure 6: Chage of theiitervals of Two Differet Boxplot Methods Figure 7: Stadard Normal Distributio ad Logormal Distributios... Figure 8: Chage i the Outlier Percetages Accordig to the Skewess of the Data... Figure 9: Chage i the Total Percetages of Outliers Accordig to the Sample Size... 5 Figure 1: Histogram ad Basic Statistics of Case 1-Case Figure 11: Flowchart of Outlier Labelig Methods Figure 1: Chage of the Two Types of Skewess Coefficiets Accordig to the Sample Size ad Data Distributio. (Note: This results came from the previous simulatio. All the values are i Table 5 ) vii

8 1. INTRODUCTION This chapter cosists of two sectios: the Backgroud ad Outlier Detectio Method. I the Backgroud, basic ideas of a outlier are discussed such as defiitios, features, ad reasos to detect outliers. I the Outlier Detectio Method sectio, characteristics of the two kids of outlier detectio methods are described briefly: formal ad iformal tests. 1.1 BACKGROUND Observed variables ofte cotai outliers that have uusually large or small values whe compared with others i a data set. Some data sets may come from homogeeous groups; others from heterogeeous groups that have differet characteristics regardig a specific variable, such as height data ot stratified by geder. Outliers ca be caused by icorrect measuremets, icludig data etry errors, or by comig from a differet populatio tha the rest of the data. If the measuremet is correct, it represets a rare evet. Two aspects of a outlier ca be cosidered. The first aspect to ote is that outliers cause a egative effect o data aalysis. Osbome ad Overbay (4) briefly categorized the deleterious effects of outliers o statistical aalyses: 1) Outliers geerally serve to icrease error variace ad reduce the power of statistical tests. ) If o-radomly distributed, they ca decrease ormality (ad i multivariate aalyses, violate assumptios of sphericity ad multivariate ormality), alterig the odds of makig both Type I ad Type II errors. 3) They ca seriously bias or ifluece estimates that may be of substative iterest. The followig example simply shows how oe outlier ca highly distort the mea, variace, ad 95% cofidece iterval for the mea. Let s suppose there is a simple data set composed of data poits 1,, 3, 4, 5, 6, 7 ad its basic statistics are as show i Table 1. Now, 1

9 let s replace data poit 7 with 77. As show i Table, the mea ad variace of the data are much larger tha that of the origial data set due to oe uusual data value, 77. The 95% cofidece iterval for the mea is also much broader because of the large variace. It may cause potetial problems whe data aalysis that is sesitive to a mea or variace is coducted. Table 1: Basic Statistic of a Simple Data Set Mea Media Variace 95 % Cofidece Iterval for the mea [. to 6.] Table : Basic Statistic After Chagig 7 ito 77 i the Simple Data Set Mea Media Variace 95 % Cofidece Iterval for the mea [ to 39.74] The secod aspect of outliers is that they ca provide useful iformatio about data whe we look ito a uusual respose to a give study. They could be the extreme values sittig apart from the majority of the data regardless of distributio assumptios. The followig two cases are good examples of outlier aalysis i terms of the secod aspect of a outlier: 1) to idetify medical practitioers who uder- or over-utilize specific procedures or medical equipmet, such as a x-ray istrumet; ) to idetify Primary Care Physicias (PCPs) with iordiately high Member Dissatisfactio Rates (MDRs) (MDRs = the umber of member complaits / PCP practice size) compared to other PCPs. 3 I summary, there are two reasos for detectig outliers. The first reaso is to fid outliers which ifluece assumptios of a statistical test, for example, outliers violatig the ormal distributio assumptio i a ANOVA test, ad deal with them properly i order to improve statistical aalysis. This could be cosidered as a prelimiary step for data aalysis. The secod reaso is to use the outliers themselves for the purpose of obtaiig certai critical iformatio about the data as was show i the above examples.

10 1. OUTLIER DETECTION METHOD There are two kids of outlier detectio methods: formal tests ad iformal tests. Formal ad iformal tests are usually called tests of discordacy ad outlier labelig methods, respectively. Most formal tests eed test statistics for hypothesis testig. They are usually based o assumig some well-behavig distributio, ad test if the target extreme value is a outlier of the distributio, i.e., weather or ot it deviates from the assumed distributio. Some tests are for a sigle outlier ad others for multiple outliers. Selectio of these tests maily depeds o umbers ad type of target outliers, ad type of data distributio. 1 May various tests accordig to the choice of distributios are discussed i Barett ad Lewis (1994) ad Iglewicz ad Hoagli (1993). Iglewicz ad Hoagli (1993) reviewed ad compared five selected formal tests which are applicable to the ormal distributio, such as the Geeralized ESD, Kurtosis statistics, Shapiro-Wilk, the Boxplot rule, ad the Dixo test, through simulatios. Eve though formal tests are quite powerful uder well-behavig statistical assumptios such as a distributio assumptio, most distributios of real-world data may be ukow or may ot follow specific distributios such as the ormal, gamma, or expoetial. Aother limitatio is that they are susceptible to maskig or swampig problems. Acua ad Rodriguez (4) defie these problems as follows: Maskig effect: It is said that oe outlier masks a secod outlier if the secod outlier ca be cosidered as a outlier oly by itself, but ot i the presece of the first outlier. Thus, after the deletio of the first outlier the secod istace is emerged as a outlier. Swampig effect: It is said that oe outlier swamps a secod observatio if the latter ca be cosidered as a outlier oly uder the presece of the first oe. I other words, after the deletio of the first outlier the secod observatio becomes a o-outlyig observatio. May studies regardig these problems have bee coducted by Barett ad Lewis (1994), Iglewicz ad Hoagli (1993), Davies ad Gather (1993), ad Bedre ad Kale (1987). O the other had, most outlier labelig methods, iformal tests, geerate a iterval or criterio for outlier detectio istead of hypothesis testig, ad ay observatios beyod the iterval or criterio is cosidered as a outlier. Various locatio ad scale parameters are mostly employed i each labelig method to defie a reasoable iterval or criterio for outlier detectio. There are two reasos for usig a outlier labelig method. Oe is to fid possible outliers as a screeig device before coductig a formal test. The other is to fid the extreme values away 3

11 from the majority of the data regardless of the distributio. While the formal tests usually require test statistics based o the distributio assumptios ad a hypothesis to determie if the target extreme value is a true outlier of the distributio, most outlier labelig methods preset the iterval usig the locatio ad scale parameters of the data. Although the labelig method is usually simple to use, some observatios outside the iterval may tur out to be falsely idetified outliers after a formal test whe the outliers are defied as oly observatios that deviate from the assumig distributio. However, if the purpose of the outlier detectio is ot a prelimiary step to fid the extreme values violatig the distributio assumptios of the mai statistical aalyses such as the t-test, ANOVA, ad regressio, but maily to fid the extreme values away from the majority of the data regardless of the distributio, the outlier labelig methods may be applicable. I additio, for a large data set that is statistically problematic, e.g., whe it is difficult to idetify the distributio of the data or trasform it ito a proper distributio such as the ormal distributio, labelig methods ca be used to detect outliers. This paper focuses o outlier labelig methods. Chapter presets the possible problems whe labelig methods are applied to skewed data. I Chapter 3, seve outlier labelig methods are outlied. I Chapter 4, the average percetages of outliers i the stadard ormal ad log ormal distributios with the same mea ad differet variaces is computed to compare the outlier percetage of the selected five outlier labelig methods accordig to the degree of the skewess ad differet sample sizes. I Chapter 5, the five selected methods are applied to real data sets. 4

12 . STATEMENT OF PROBLEM Outlier-labelig methods such as the Stadard Deviatio (SD) ad the boxplot are commoly used ad are easy to use. These methods are quite reasoable whe the data distributio is symmetric ad moud-shaped such as the ormal distributio. Figure 1 shows that about 68%, 95%, ad 99.7% of the data from a ormal distributio are withi 1,, ad 3 stadard deviatios of the mea, respectively. If data follows a ormal distributio, this helps to estimate the likelihood of havig extreme values i the data 3, so that the observatio two or three stadard deviatios away from the mea may be cosidered as a outlier i the data. Figure 1: Probability desity fuctio for a ormal distributio accordig to the stadard deviatio. The boxplot which was developed by Tukey (1977) is aother very helpful method sice it makes o distributioal assumptios or does it deped o a mea or stadard deviatio. 19 The lower quartile (q1) is the 5th percetile, ad the upper quartile (q3) is the 75th percetile of the data. The iter-quartile rage (IQR) is defied as the iterval betwee q1 ad q3. 5

13 Tukey (1997) defied q1-(1.5*iqr) ad q3+(1.5*iqr) as ier feces, q1-(3*iqr) ad q3+(3*iqr) as outer feces, the observatios betwee a ier fece ad its earby outer fece as outside, ad aythig beyod outer feces as far out. 31 High () reamed the outside potetial outliers ad the far out problematic outliers. 19 The outside ad far out observatios ca also be called possible outliers ad probable outliers, respectively. This method is quite effective, especially whe workig with large cotiuous data sets that are ot highly skewed. 19 Although Tukey s method is quite effective whe workig with large data sets that are fairly ormally distributed, may distributios of real-world data do ot follow a ormal distributio. They are ofte highly skewed, usually to the right, ad i such cases the distributios are frequetly closer to a logormal distributio tha a ormal oe. 1 The logormal distributio ca ofte be applied to such data i a variety of forms, for istace, persoal icome, blood pressure, ad assessmet of toxic compouds i evirometal aalysis. I order to illustrate how the theoretical percetage of outliers chages accordig to the skewess of the data i the SD method (Mea ± SD, Mea ± 3 SD) ad Tukey s method, logormal distributios with the same mea () but differet stadard deviatios (.,.4,.6,.8, 1., 1.) are used for the data sets with differet degrees of skewess, ad the stadard ormal distributio is used for the data set whose skewess is zero. The computatio of the mea, stadard deviatio, ad skewess i a logormal distributio is i Appedix A. Accordig to Figure, the two methods show a differet patter, e.g., the outlier percetage of Tukey s method icreases, ulike the SD method. It shows that the results of outlier detectio may chage depedig o the outlier detectio methods or the distributio of the data. 6

14 Outlier Skewess SD Method (Mea ± SD) 3 SD Method (Mea ± 3 SD) Tukey's Method (1.5 IQR) Tukey's Method (3 IQR) Figure : Theoretical Chage of Outliers Percetage Accordig to the Skewess of the Logormal Distributios i the SD Method ad Tukey s Method Whe data are highly skewed or i other respects depart from a ormal distributio, trasformatios to ormality is a commo step i order to idetify outliers usig a method which is quite effective i a ormal distributio. Such a trasformatio could be useful whe the idetificatio of outliers is coducted as a prelimiary step for data aalysis ad it helps to make possible the selectio of appropriate statistical procedures for estimatig ad testig as well. 1 However, if a outlier itself is a primary cocer i a give study, as was show i a previous example i the idetificatio of medical practitioers who uder- or over-utilize such medical equipmet as x-ray istrumets, a trasformatio of the data could affect our ability to idetify outliers. For example, 5 radom samples (x) are geerated through statistical software R i order to show the effect of the trasformatio. The radom variable X has a logormal distributio (Mea=1, SD=1), ad its logarithm, Y=log(x), has a ormal distributio. If the observatios which are beyod the mea by two stadard deviatios are cosidered outliers, the expected outliers before ad after trasformatio are totally differet. As show i Figure 3, while three observatios which have large values are cosidered as outliers i the origial 5 radom samples(x), after log trasformatio of these samples, two observatios of small values appear to be outliers, ad the former large valued observatios are o loger cosidered to be outliers. The vertical lies i each graph represet cutoff values (Mea ± *SD). Lower ad 7

15 upper cutoff values are ( , ) ad ( ,.76336), respectively, i the logormal data(x) ad its logarithm(y). Although this approach is ot be affected by extreme values because it does ot deped o the extreme observatios after trasformatio, after a artificial trasformatio of the data, however, the data may be reshaped so that true outliers are ot detected or other observatios may be falsely idetified as outliers. 1 dlorm(x, 1, 1, )..1. dorm(y, 1, 1, ) x y x y Figure 3: Desity Plot ad Dotplot of the Logormal Distributio (sample size=5) with Mea=1 ad SD=1, ad its Logarithm, Y=log(x). Several methods to idetify outliers have bee developed. Some methods are sesitive to extreme values like the SD method, ad others are resistat to extreme values like Tukey s method. The objective of this paper is to review ad compare several commo ad less commo labelig methods for idetifyig outliers ad to preset iformatio that shows how the average percetage of outliers chages i each method accordig to the degree of skewess ad sample size of the data i order to help establish guidelies for the choice of outlier detectio methods i skewed data whe a outlier itself is a primary cocer i a give study. 8

16 3. OUTLIER LABELING METHOD This chapter reviews seve outlier labelig methods ad gives examples of simple umerical computatios for each test. 3.1 STANDARD DEVIATION (SD) METHOD The simple classical approach to scree outliers is to use the SD (Stadard Deviatio) method. It is defied as SD Method: x ± SD 3 SD Method: x ± 3 SD, where the mea is the sample mea ad SD is the sample stadard deviatio. The observatios outside these itervals may be cosidered as outliers. Accordig to the Chebyshev iequality, if a radom variable X with mea μ ad variace σ exists, the for ay k >, 1 P[ X μ kσ ] k 1 P[ X μ < kσ ] 1-, k > k the iequality [1-(1/k) ] eables us to determie what proportio of our data will be withi k stadard deviatios of the mea 3. For example, at least 75%, 89%, ad 94% of the data are withi, 3, ad 4 stadard deviatios of the mea, respectively. These results may help us determie the likelihood of havig extreme values i the data 3. Although Chebychev's therom is true for ay data from ay distributio, it is limited i that it oly gives the smallest proportio of observatios withi k stadard deviatios of the mea. I the case of whe the distributio of a 9

17 radom variable is kow, a more exact proportio of observatios ceterig aroud the mea ca be computed. For istace, if certai data follow a ormal distributio, approximately 68%, 95%, ad 99.7% of the data are withi 1,, ad 3 stadard deviatios of the mea, respectively; thus, the observatios beyod two or three SD above ad below the mea of the observatios may be cosidered as outliers i the data. The example data set, X, for a simple example of this method is as follows: 3., 3.4, 3.7, 3.7, 3.8, 3.9, 4, 4, 4.1, 4., 4.7, 4.8, 14, 15. For the data set, x = 5.46, SD=3.86, ad the itervals of the SD ad 3 SD methods are (-.5, 13.18) ad (-6.11, 17.4), respectively. Thus, 14 ad 15 are beyod the iterval of the SD method ad there are o outliers i the 3 SD method. 3. Z-SCORE Aother method that ca be used to scree data for outliers is the Z-Score, usig the mea ad stadard deviatio. Z i xi x =, where X i ~ N (µ, σ ), ad sd is the stadard deviatio of data. sd The basic idea of this rule is that if X follows a ormal distributio, N (µ, σ ), the Z follows a stadard ormal distributio, N (, 1), ad Z-scores that exceed 3 i absolute value are geerally cosidered as outliers. This method is simple ad it is the same formula as the 3 SD method whe the criterio of a outlier is a absolute value of a Z-score of at least 3. It presets a reasoable criterio for idetificatio of the outlier whe data follow the ormal distributio. Accordig to Shiffler (1988), a possible maximum Z-score is depedet o sample size, ad it is computed as ( 1) /. The proof is give i Appedix B. Sice o z-score exceeds 3 i a sample size less tha or equal to 1, the z-score method is ot very good for outlier labelig, particularly i small data sets 1. Aother limitatio of this rule is that the stadard deviatio ca be iflated by a few or eve a sigle observatio havig a extreme value. Thus it ca cause a maskig problem, i.e., the less extreme outliers go udetected because of the most extreme outlier(s), ad vice versa. Whe maskig occurs, the outliers may be eighbors. Table 3 shows 1

18 a computatio ad maskig problem of the Z-Score method usig the previous example data set, X. Table 3: Computatio ad Maskig Problem of the Z-Score i Case 1 ( x =5.46, sd=3.86) Case ( x =4.73, sd=.8) x i Z-Score x i Z-Score For case 1, with all of the example data icluded, it appears that the values 14 ad 15 are outliers, yet o observatio exceeds the absolute value of 3. For case, with the most extreme value, 15, amog example data excluded, 14 is cosidered a outlier. This is because multiple extreme values have artificially iflated stadard deviatios. 3.3 THE MODIFIED Z-SCORE Two estimators used i the Z-Score, the sample mea ad sample stadard deviatio, ca be affected by a few extreme values or by eve a sigle extreme value. To avoid this problem, the media ad the media of the absolute deviatio of the media (MAD) are employed i the 11

19 modified Z-Score istead of the mea ad stadard deviatio of the sample, respectively (Iglewicz ad Hoagli, 1993). MAD = media{ x ~ x }, where x~ is the sample media. i The modified Z-Score ( M ) is computed as M i i.6745( x ~ i x ) =, where E( MAD )=.675 σ for large ormal data. MAD Iglewicz ad Hoagli (1993) suggested that observatios are labeled outliers whe M >3.5 through the simulatio based o pseudo-ormal observatios for sample sizes of i 1,, ad 4. 1 The M i score is effective for ormal data i the same way as the Z-score. Table 4: Computatio of Modified Z-Score ad its Compariso with the Z-Score i x i Z-Score modified Z-Score Table 4 shows the computatio of the modified Z-Score ad its compariso with the Z- Score of the previous example data set. While o observatio is detected as a outlier i the Z- Score, two extreme values, 14 ad 15, are detected as outliers at the same time i the modified Z- Score sice this method is less susceptible to the extreme values. 1

20 3.4 TUKEY S METHOD (BOXPLOT) Tukey s (1977) method, costructig a boxplot, is a well-kow simple graphical tool to display iformatio about cotiuous uivariate data, such as the media, lower quartile, upper quartile, lower extreme, ad upper extreme of a data set. It is less sesitive to extreme values of the data tha the previous methods usig the sample mea ad stadard variace because it uses quartiles which are resistat to extreme values. The rules of the method are as follows: 1. The IQR (Iter Quartile Rage) is the distace betwee the lower (Q1) ad upper (Q3) quartiles.. Ier feces are located at a distace 1.5 IQR below Q1 ad above Q3 [Q1-1.5 IQR, Q3+1.5IQR]. 3. Outer feces are located at a distace 3 IQR below Q1 ad above Q3 [Q1-3 IQR, Q3+3 IQR]. 4. A value betwee the ier ad outer feces is a possible outlier. A extreme value beyod the outer feces is a probable outlier. There is o statistical basis for the reaso that Tukey uses 1.5 ad 3 regardig the IQR to make ier ad outer feces. For the previous example data set, Q1=3.75, Q3=4.575, ad IQR=.85. Thus, the ier fece is [.45, 5.85] ad the outer fece is [1.18, 7.13]. Two extreme values, 14 ad 15, are idetified as probable outliers i this method. Figure 4 is a boxplot geerated usig the statistical software STATA for the example data set Figure 4: Boxplot for the Example Data Set 13

21 While previous methods are limited to moud-shaped ad reasoably symmetric data such as the ormal distributio 1, Tukey s method is applicable to skewed or o moud-shaped data sice it makes o distributioal assumptios ad it does ot deped o a mea or stadard deviatio. However, Tukey s method may ot be appropriate for a small sample size 1. For example, let s suppose that a data set cosists of data poits 145, 147, 9, 93, 418, 158, ad 9. A simple distributio of the data usig a Boxplot ad Dotplot are show i Figure 5. Although 158 ad 9 may appear to be outliers i the dotplot, o observatio is show as a outlier i the boxplot. 1,, 3, Figure 5: Boxplot ad Dotplot. (Note: No outlier show i the boxplot) 3.5 ADJUSTED BOXPLOT Although the boxplot proposed by Tukey (1977) may be applicable for both symmetric ad skewed data, the more skewed the data, the more observatios may be detected as outliers, 3 as show i Figure. This results from the fact that this method is based o robust measures such as lower ad upper quartiles ad the IQR without cosiderig the skewess of the data. Vaderviere ad Huber (4) itroduced a adjusted boxplot takig ito accout the medcouple (MC) 3, a robust measure of skewess for a skewed distributio. 14

22 Whe X ={ x 1, x,..., x } is a data set idepedetly sampled from a cotiuous uivariate distributio ad it is sorted such as j i x1 x... x, the MC of the data is defied as ( x j med k ) ( med k xi ) MC( x 1,..., x ) = med,where medk is the media of X, ad x x i ad j have to satisfy x i med k follows (G. Bray et al. (5)): x j, ad x i x j. The iterval of the adjusted boxplot is as [L, U] = [Q * exp (-3.5MC) * IQR, Q * exp (4MC) * IQR] if MC = [Q * exp (-4MC) * IQR, Q * exp (3.5MC) * IQR] if MC, where L is the lower fece, ad U is the upper fece of the iterval. The observatios which fall outside the iterval are cosidered outliers. The value of the MC rages betwee -1 ad 1. If MC=, the data is symmetric ad the adjusted boxplot becomes Tukey s box plot. If MC>, the data has a right skewed distributio, whereas if MC<, the data has a left skewed distributio. 3 MC ad a brief compariso of classical ad MC skewess are i Appedix C. A simple example for computatio of For the previous example data set, Q1=3.75, Q3=4.575, IQR=.85, ad MC=.43. Thus, the iterval of the adjusted boxplot is [3.44, 11.6]. Two extreme values, 14 ad 15, ad the two smallest values, 3. ad 3.4, are idetified as outliers i this method. Figure 6 shows the chage of the itervals of two boxplot methods, Tukey s method ad the adjusted boxplot, for the example data set. The vertical dotted lies are the lower ad upper boud of the iterval of each method. Although the example data set is artificial ad is ot large eough to explai their differece, we ca see a geeral tred that the iterval of the adjusted boxplot, especially the upper fece, moves to the side of the skewed tail, compared to Tukey s method. 15

23 Ier feces of Tukey Method (Q1-1.5*IQR, Q3+1.5*IQR) Outer feces of Tukey Method (Q1-3*IQR, Q3+3IQR) Sigle fece of adjusted box plot (Q1-1.5 * exp (-3.5MC) * IQR, Q3+1.5 * exp (4MC) * IQR) Figure 6: Chage of theiitervals of Two Differet Boxplot Methods (Tukey s Method vs. the Adjusted Boxplot) Vaderviere ad Huber (4) computed the average percetage of outliers beyod the lower ad upper fece of two types of boxplots, the adjusted Boxplot ad Tukey s Boxplot, for several distributios ad differet sample sizes. I the simulatio, less observatios, especially i the right tail, are classified as outliers compared to Tukey s method whe the data are skewed to the right. 3 I the case of a mildly right-skewed distributio, the lower fece of the iterval may move to the right ad more observatios i the left side will be classified as outliers compared to Tukey s method. This differece maily comes from a decrease i the lower fece ad a icrease i the upper fece from Q1 ad Q3, repectively. 3 16

24 3.6 MAD E METHOD The MAD e method, usig the media ad the Media Absolute Deviatio (MAD), is oe of the basic robust methods which are largely uaffected by the presece of extreme values of the data set. 11 This approach is similar to the SD method. However, the media ad MAD e are employed i this method istead of the mea ad stadard deviatio. The MAD e method is defied as follows; MAD e Method: Media ± MAD e 3 MAD e Method: Media ± 3 MAD e, where MAD e =1.483 MAD for large ormal data. MAD is a estimator of the spread i a data, similar to the stadard deviatio 11, but has a approximately 5% breakdow poit like the media 1. The otio of breakdow poit is delieated i Appedix D. MAD= media ( x i media(x) i=1,,, ) Whe the MAD value is scaled by a factor of 1.483, it is similar to the stadard deviatio i a ormal distributio. This scaled MAD value is the MAD e. For the example data set, the media=4, MAD=.3, ad MAD e =.44. Thus, the itervals of the MAD e ad 3 MAD e methods are [3.11, 4.89] ad [.67, 5.33], respectively. Sice this approach uses two robust estimators havig a high breakdow poit, i.e., it is ot uduly affected by extreme values eve though a few observatios make the distributio of the data skewed, the iterval is seldom iflated, ulike the SD method. 3.7 MEDIAN RULE The media is a robust estimator of locatio havig a approximately 5% breakdow poit. It is the value that falls exactly i the ceter of the data whe the data are arraged i order. 17

25 That is, if x 1, x,, x is a radom sample sorted by order of magitude, the the media is defied as: Media, ~ x = x m whe is odd x~ = (x m +x m+1 )/ whe is eve, where m=roud up (/) For a skewed distributio like icome data, the media is ofte used i describig the average of the data. The media ad mea have the same value i a symmetrical distributio. Carlig (1998) itroduces the media rule for idetificatio of outliers through studyig the relatioship betwee target outlier percetage ad Geeralized Lambda Distributios (GLDs). GLDs with differet parameters are used for various moderately skewed distributios 1. The media substitutes for the quartiles of Tukey s method, ad a differet scale of the IQR is employed i this method. It is more resistat ad its target outlier percetage is less affected by sample size tha Tukey s method i the o-gaussia case 1. The scale of IQR ca be adjusted depedig o which target outlier percetage ad GLD are selected. I my paper,.3 is chose as the scale of IQR; whe the scale is applied to ormal distributio, the outlier percetage turs out to be betwee Tukey s method of 1.5 IQR ad that of 3 IQR, i.e.,. %. It is defied as: [C 1, C ]=Q ±.3 IQR, where Q is the sample media. For the example data set, Q=4, ad IQR=.85. Thus, the iterval of this method is [.5, 5.96]. 18

26 4. SIMULATION STUDY AND RESULTS FOR THE FIVE SELECTED LABELING METHODS Most itervals or criteria to idetify possible outliers i outlier labelig methods are effective uder the ormal distributio. For example, i the case of a well-kow labelig method such as the SD ad 3 SD methods ad the Boxplot (1.5 IQR), the expected percetages of observatios outside the iterval are 5%,.3%, ad.7%, respectively, uder large ormal samples. Although these methods are quite powerful with large ormal data, it may be problematic to apply them to o-ormal data or small sample sizes without iformatio about their characteristics i these circumstaces. This is because each labelig method has differet measures to detect outliers, ad expected outlier percetages chage differetly accordig to the sample size or distributio type of the data. The purpose of this simulatio is to preset the expected percetage of the observatios outside of the iterval of several labelig methods accordig to the sample size ad the degree of the skewess of the data usig the logormal distributio with the same mea ad differet variaces. Through this simulatio, we ca kow ot oly the possible outlier percetage of several labelig methods but also which method is more robust accordig to the above two factors, skewess ad sample size. The simulatio proceeds as follows: Five labelig methods are selected: the SD Method, the MADe Method, Tukey s Method (Boxplot), Adjusted Boxplot, ad the Media Rule. The Z-Score ad modified Z-Score are ot cosidered because their criteria to defie a outlier are based o the ormal distributio. Average outlier percetages of five labelig methods i the stadard ormal (,1) ad logormal distributios with the same mea ad differet variaces (mea=, variace=.,.4,.6,.8, 1 ) are computed. For each distributio, 1 replicatios of sample sizes ad 5, 3 replicatios of the sample size 1, ad 1 replicatios of the sample sizes 3 ad 5 are cosidered. To illustrate the shape of each distributio, i.e., the degree of skewess of the data, 19

27 5 radom observatios were geerated from the distributios, ad their desity plots ad skewess are as show i Figure 7. Desity Value...4 Stadard Normal cs=.15 mc=.53 Desity Value. 1.. Logormal(,.) cs=.7 mc= x x Desity Value..4.8 Logormal(,.4) cs=1.56 mc=.6 Desity Value..4.8 Logormal(,.6) cs=.559 mc= x x Logormal(,.8) cs=3.999 mc=.379 Logmormal(,1.) cs=5.99 mc=.446 Desity Value..4 Desity Value x x Figure 7: Stadard Normal Distributio ad Logormal Distributios (cs=classical skewess, mc=medcouple skewess) Figures 8 ad 9 visually show the characteristics of the five labelig methods accordig to the sample size ad skewess of the data usig the logormal distributio. All the values of the Figures icludig their stadard error of the average percetage are reported i Table 5. The results of this simulatio are as follows: 1. The MADe method classifies more observatios as outliers tha ay other method. This method approaches the SD method i large ormal data; however, as the data icreases i skewess, the differece i outlier percetages betwee the MADe method ad the SD method

28 becomes larger sice the locatio ad scale measures such as the media ad MADe become the same as the mea ad stadard variace of the SD method whe data follows a ormal distributio with a large sample size. The MADe, Tukey s method, ad the Media rule icrease i the total average percetages of outliers the more skewed the data, while the SD method ad adjusted boxplot seldom chage over differet sample sizes.. The Media rule classifies less observatios tha Tukey s 1.5 IQR method ad more observatios tha Tukey s 3 IQR method. 3. The decrease rage of the total outlier percetage of the adjusted boxplot is larger tha other methods as the sample size icreases. 4. Most methods except the adjusted boxplot show similar patters i the average outlier percetages o the left side of the distributio. They decrease i left outlier percetage rapidly, especially i MADe ad SD methods, the more skewed the data; however, the adjusted boxplot decreases slowly i sample sizes over 3. Differet patters of the adjusted boxplot, e.g., icrease i left outlier percetage i small sample sizes, may be due to the followig: The left fece of the iterval may move to the right side because of the MC skewess ad a few observatios may be distributed outside the left fece by chace. Although the umber of the observatios is small, the ratio i a small sample size could large. This may affect a icrease i the average of the percetage of outliers o the left of the distributio. The adjusted boxplot may still detect observatios o the left side of the distributio i right skewed data, especially mildly skewed data; however, the average percetages are quiet low. 5. The MADe, Tukey s method, ad the Media rule icrease i the percetage of outliers o the right side of the distributio as the skewess of the data icreases while the SD method ad adjusted boxplot seldom chage i each sample size (the SD method icreases slightly ad plateaus). The right fece of the itervals of both methods, the SD method ad adjusted boxplot, move to the right side of the distributio as the skewess of the data icreases. Sice the adjusted boxplot takes ito accout the skewess of the data, its right fece of the iterval moves more to the side of the skewed tail, here the right side of the distributio, as the skewess icreases. O the other had, the iterval of the SD method is just iflated because of the extreme values. 1

29 Sample size Sample size 5 Figure 8: Chage i the Outlier Percetages Accordig to the Skewess of the Data

30 Sample1 Sample size 3 Figure 8 (cotiued) 3

31 Sample size 5 Figure 8 (cotiued) 4

32 Figure 9: Chage i the Total Percetages of Outliers Accordig to the Sample Size 5

33 Figure 9 (cotiued) 6

34 Table 5: The Average Percetage of Left Outliers, Right Outliers ad the Average Total Percet of Outliers for the Logormal Distributios with the Same Mea ad Differet Variaces (mea=, variace=.,.4,.6,.8, 1. ) ad the Stadard Normal Distributio with Differet Sample Sizes. Distributio SN LN (,.) LN (,.4) LN (,.6) CS.6 (.15) -.17 (.1) -.6 (.13).6 (.15) -.8 (.1).436 (.16).57 (.1).574 (.18).64 (.).69 (.15).864 (.) 1.6 (.17) (.7) 1.51 (.33) 1.33 (.5) 1.1 (.4) 1.63 (.4) (.39).1 (.63).199 (.64) MC -.4 (.7) -.9 (.5).1 (.6).4 (.6).4 (.5).84 (.7).86 (.5).79 (.6).93 (.6).94 (.4).161 (.7).17 (.5).181 (.7).167 (.6).17 (.5).19 (.7).5 (.5).51 (.6).54 (.7).55 (.5) Left (.8).176 (.53).6 (.66).67 (.73).66 (.51).555 (.5).71 (.37).73 (.5).676 (.44).594 (.35).95 (.).4 (.9). (.8).7 (.5). (.) () () () () () SD Method MADe Method Mea ± SD Mea ± 3 SD Media ± MADe Media ± 3 MADe Right Total Left Right Total Left Right Total Left Right Total (.83) (.11) (.1) (.11) (.16) (.15) (.156) (.41) (.66) (.73) (.19) (.5) (.63) (.13) (.1) (.17) (.95) (.88) (.141) (.3) (.7) (.45) (.6) (.79) (.17) (.) (.6) (.115) (.19) (.184) (.3) (.36) (.55) (.6) (.86) (.19) (.1) (.6) (.11) (.99) (.173) (.9) (.8) (.4) (.47) (.59) (.16) (.17) (.5) (.78) (.8) (.133) (.19) (.19) (.9) (.9) (.95) () (.31) (.31) (.11) (.183) (.7) (.34) (.18) (.119) (.55) (.59) () (.8) (.8) (.6) (.114) (.141) (.8) (.57) (.59) (.73) (.76) (.3) (.38) (.38) (.77) (.139) (.168) (.9) (.65) (.67) (.71) (.81) () (.35) (.35) (.6) (.16) (.185) () (.68) (.68) (.64) (.65) () (.9) (.9) (.4) (.116) (.13) () (.51) (.51) (.9) (.9) () (.55) (.55) (.91) (.197) (.5) (.3) (.141) (.144) (.55) (.54) () (.37) (.37) (.5) (.17) (.133) (.) (.84) (.85) (.73) (.73) () (.44) (.44) (.3) (.168) (.173) () (.11) (.11) (.65) (.66) () (.46) (.46) (.14) (.158) (.163) () (.94) (.94) (.56) (.57) () (.3) (.3) (.5) (.149) (.151) () (.74) (.74) (.84) (.84) () (.69) (.69) (.4) (.16) (.4) (.5) (.164) (.165) (.56) (.56) () (.38) (.38) (.11) (.14) (.14) () (.15) (.15) (.74) (.74) () (.5) (.5) (.1) (.17) (.171) () (.133) (.133) (.86) (.86) () (.51) (.51) () (.178) (.178) () (.146) (.146) (.68) (.68) () (.47) (.47) () (.145) (.145) () (.16) (.16) 7

35 8 Table 5 (cotiued) SD Method MADe Method Mea ± SD Mea ± 3 SD Media ± MADe Media ± 3 MADe Distributio CS MC Left Right Total Left Right Total Left Right Total Left Right Total 1.56 (.4).31 (.7).5 (.5) 5.71 (.81) (.81) () 1.86 (.76) 1.86 (.76).95 (.31) 13.4 (.18) (.4).5 (.5) (.191) 7.9 (.191) 5.16 (.3).315 (.5) () 5.5 (.58) 5.5 (.58) ().13 (.38).13 (.38).6 (.4) 1.93 (.143) (.144) () 7.33 (.1) 7.33 (.1) (.58).314 (.7) () (.84) (.84) ().177 (.49).177 (.49) () (.183) (.183) () 7.83 (.151) 7.83 (.151) (.15).37 (.7) () 4.5 (.13) 4.5 (.13) () (.53) (.53) () 1.75 (.193) 1.75 (.193) () 7.17 (.16) 7.17 (.16) LN (,.8) 5.98 (.96).34 (.5) () 4.48 (.76) 4.48 (.76) () 1.94 (.41) 1.94 (.41) () 1.84 (.13) 1.84 (.13) () 7.9 (.1) 7.9 (.1) (.6).353 (.7) () 6.3 (.78) 6.3 (.78) ().455 (.79).455 (.79).5 (.5) (.17) (.18). (.16) (.195) (.195) 5.66 (.34).384 (.5) () 5.48 (.6) 5.48 (.6) ().486 (.37).486 (.37) () (.15) (.15) () 1.14 (.133) 1.14 (.133) (.79).4 (.6) () 4.76 (.96) 4.76 (.96) ().73 (.51).73 (.51) () (.19) (.19) () (.165) (.165) (.183).399 (.7) () 4.7 (.18) 4.7 (.18) ().13 (.6).13 (.6) () 15.6 (.11) 15.6 (.11) () 9.67 (.187) 9.67 (.187) LN (, 1.) (.155).394 (.5) () 4.15 (.85) 4.15 (.85) () 1.99 (.46) 1.99 (.46) () (.137) (.137) () 9.73 (.16) 9.73 (.16) Tukey s Method Adjusted Boxplot Media Rule Q1-1.5 IQR / Q3+1.5 IQR Q1-3 IQR / Q3+3 IQR Q1-1.5exp(-3.5mc)/ Q3+1.5exp(4mc) Q ±.3 IQR Distributio CS MC Left Right Total Left Right Total Left Right Total Left Right Total.6 (.15) -.4 (.7) 1.1 (.83) 1.17 (.89).7 (.137).65 (.19).4 (.14).15 (.6).39 (.135).75 (.153) 5.14 (.178).615 (.61).685 (.67) 1.3 (.1) (.1) -.9 (.5).74 (.43).66 (.41) (.65).6 (.3). (.).8 (.4) 1.46 (.9).7 (.11) 3.53 (.11).9 (.8).36 (.4).58 (.39) (.13).1 (.6).537 (.49).51 (.49) 1.47 (.78).3 (.3).3 (.3).7 (.5) 1.5 (.11) 1.18 (.17).3 (.15).18 (.6).183 (.8).363 (.4) 3.6 (.15).4 (.6).4 (.45).363 (.38).783 (.61) () () ().59 (.77).647 (.93) 1.37 (.98).17 (.5).13 (.4).57 (.35) SN (.1).4 (.5).34 (.9).354 (.3).696 (.47) (). (.). (.).564 (.71).468 (.57).89 (1.17).11 (.16).11 (.16). (.3)

36 9 Table 5 (cotiued) Tukey s Method Adjusted Boxplot Media Rule Q1-1.5 IQR / Q3+1.5 IQR Q1-3 IQR / Q3+3 IQR Q1-1.5exp(-3.5mc)/ Q3+1.5exp(4mc) Q ±.3 IQR Distributio CS MC Left Right Total Left Right Total Left Right Total Left Right Total.436 (.16).84 (.7).415 (.5).9 (.113).75 (.137) ().1 (.33).1 (.33).75 (.146).395 (.143) 5.1 (.177).19 (.35) (.98) (.111) 5.57 (.1).86 (.5).146 (.) 1.86 (.67) 1.95 (.75) ().18 (.15).18 (.15) (.13) (.91) 3.41 (.118).8 (.8) (.5) 1.14 (.54) (.18).79 (.6).63 (.1) 1.6 (.76) (.78) ().63 (.15).63 (.15).95 (.11) 1.1 (.13).5 (.15).3 (.3).913 (.55).917 (.55) 3.64 (.).93 (.6). (.9) (.86) 1.67 (.86) ().77 (.16).77 (.16).8 (.13).543 (.68) (.97) ().94 (.64).94 (.64) LN (..) 5.69 (.15).94 (.4).1 (.6) 1.51 (.6) 1.54 (.6) ().36 (.9).36 (.9).47 (.7).356 (.39).88 (.66) ().838 (.44).838 (.44).864 (.).161 (.7).145 (.33) 3.85 (.131) 3.95 (.139) ().755 (.63).755 (.63).785 (.153).16 (.13) (.175).5 (.11).87 (.1).895 (.11) (.17).17 (.5).1 (.4) (.88) 3.56 (.88) ().56 (.34).56 (.34).38 (.11) 1.54 (.87) 3.54 (.119) ().538 (.76).538 (.76) (.7).181 (.7) () 3.3 (.15) 3.3 (.15) ().373 (.38).373 (.38) (.143).717 (.71).153 (.143) ().143 (.9).143 (.9) (.33).167 (.6) ().93 (.95).93 (.95) ().363 (.4).363 (.4).587 (.98).553 (.61) 1.14 (.98) ().47 (.83).47 (.83) LN (,.4) (.5).17 (.5) () 3.78 (.77) 3.78 (.77) ().41 (.3).41 (.3).4 (.59).514 (.48).916 (.65) ().54 (.67).54 (.67) 1.1 (.4).19 (.7).1 (.7) 5.5 (.151) 5.15 (.15) () 1.48 (.86) 1.48 (.86) 3.75 (.169) 1.94 (.117) 5.15 (.181) () (.139) (.139) (.4).5 (.5) () 4.8 (.95) 4.8 (.95) () 1.7 (.5) 1.7 (.5).178 (.13) (.69) 3.33 (.15) () 4.98 (.87) 4.98 (.87) (.39).51 (.6) () (.18) (.18) () 1.15 (.69) 1.15 (.69) (.134).767 (.77).163 (.136) () (.119) (.119) 3.1 (.63).54 (.7) () 4.81 (.13) 4.81 (.13) () (.61) (.61).633 (.18).593 (.66) 1.7 (.13) () 3.97 (.119) 3.97 (.119) LN (,.6) (.64).55 (.5) () 4.59 (.93) 4.59 (.93) () 1.7 (.48) 1.7 (.48).5 (.74).496 (.58) 1.16 (.78) () 3.7 (.8) 3.7 (.8) 1.56 (.4).31 (.7).1 (.1) (.16) 6.85 (.163) ().595 (.113).595 (.113) 3.46 (.177) (.1) (.191) () 5.91 (.157) 5.91 (.157) 5.16 (.3).315 (.5) () (.1) (.1) ().8 (.68).8 (.68).134 (.1) 1.18 (.74) 3.35 (.17) () (.98) (.98) (.58).314 (.7) () (.131) (.131) ()..7 (.84)..7 (.84) 1.8 (.147).99 (.93).7 (.158) () 5.65 (.18) 5.65 (.18) (.15).37 (.7) () 6.67 (.137) 6.67 (.137) ().153 (.75).153 (.75).59 (.15).64 (.6) 1.3 (.119) () (.134) (.134) LN (,.8) 5.98 (.96).34 (.5) () (.113) (.113) () (.68) (.68).4 (.56).53 (.5).774 (.63) () (.13) (.13)

37 Table 5 (cotiued) Distributio CS MC LN (, 1.) (.6).66 (.34) 3.86 (.79) (.183) 4.5 (.155).353 (.7).384 (.5).4 (.6).399 (.7).394 (.5) Q1-1.5 IQR / Q3+1.5 IQR Left () () () () () Right 8.37 (.166) 8.16 (.11) (.144) 7.73 (.158) 7.68 (.1) Tukey s Method Adjusted Boxplot Media Rule Total 8.37 (.166) 8.16 (.11) (.144) 7.73 (.158) 7.68 (.1) Left () () () () () Q1-3 IQR / Q3+3 IQR Right 4.5 (.133) (.83) (.1) 3.3 (.11) (.75) Total 4.5 (.133) (.83) (.1) 3.3 (.11) (.75) Left (.179) (.11) 1. (.135).43 (.114).134 (.4) Q1-1.5exp(-3.5mc)/ Q3+1.5exp(4mc) Right.385 (.134) 1.41 (.76).847 (.74).687 (.6).616 (.5) Total 5.57 (.197) 3.38 (.17).47 (.138) 1.11 (.116).75 (.58) Left () () () () () Q ±.3 IQR Right (.163) (.17) 7.63 (.143) 7.4 (.148) (.116) Total (.163) (.17) 7.63 (.143) 7.4 (.148) (.116) (stadard error of the average percetage of outliers) 3

38 5. APPLICATION I this chapter the five selected outlier labelig methods are applied to three real data sets ad oe modified data set of oe of the three real data sets. These real data sets are provided by Gateway Health Pla, a maaged care alterative to the Departmet of Public Welfare s Medical Assistace Program i Pesylvaia. These data sets are part of Primary Care Provider (PCP) s basic iformatio which is eeded to idetify providers (PCPs) associated with Member Dissatisfactio Rates (MDRs = the umber of member complaits/pcp practice size) that are uusually high compared with other PCPs of similar sized practices 3. Case 1 (data set 1) is visit per 1 office med, ad its distributio is ot very differet from the ormal distributio. Case (data set ) is Scripts per 1 Rx, ad its distributio is mildly skewed to the right. Case 3 (data set 3) is Svcs per 1 early child im, ad its distributio is highly skewed to the right because of oe observatio which has a extremely large value. Case 4 (data set 4) is the data set which is modified from the data set 3 by meas of excludig the most extreme value from the data set 3 to see the possible effect of the oe extreme outlier over the outlier labelig methods. Figure 1 shows the basic statistics ad distributio of each data set (Case 1-Case 4). Desity 1.e-4.e-4 3.e-4 4.e case 1 Figure 1: Histogram ad Basic Statistics of Case 1-Case 4 Mi: 3.8 1st Qu.: 33.3 Mea: Media: rd Qu.: Max: Total N: 9 Variace: Std Dev.: SE Mea: LCL Mea: UCL Mea: Skewess:.597 Kurtosis:.793 Medcouple skewess:.64 3

39 Desity 1.e-5.e-5 3.e-5 4.e-5 5.e case Mi: st Qu.: Mea: Media: rd Qu.: Max: Total N: 9. Variace: Std Dev.: SE Mea: LCL Mea: UCL Mea: Skewess: 1.91 Kurtosis: Medcouple skewess:.187 Desity 5.e-5 1.e-4 1.5e-4.e case 3 Mi: st Qu.: Mea: Media: rd Qu.: Max: 7 Total N: 17 Variace: Std Dev.: SE Mea: LCL Mea: UCL Mea: Skewess: Kurtosis: Medcouple skewess:.11 Desity 5.e-5 1.e-4 1.5e-4.e case 4 Mi: st Qu.: Mea: Media: rd Qu.: Max: 16 Total N: 16 Variace: Std Dev.: SE Mea: LCL Mea: UCL Mea: Skewess: Kurtosis: Medcouple skewess:.119 Figure 1 (cotiued) 33

40 Table 6 shows the left, right, ad total umber of outliers idetified i each data set after applyig the five outlier labelig methods. Sample programs for Case 4 are give i APPENDIX E. Table 6: Iterval, Left, Right, ad Total Number of Outliers Accordig to the Five Outlier Methods Case 1 (Data set 1): N=9 Method Iterval Left Right Total SD Method (131.49, ) (.96) 6 (.87) 8 (3.83) 3 SD Method ( , ) () 1 (.48) 1 (.48) Tukey s Method (1.5 IQR) (376.81, ) 1 (.48) (.96) 3 (1.44) Tukey s Method (3 IQR) (-49.6, 17.3) () () () Adjusted Boxplot (95.41, ) 1 (.48) 1 (.48) (.96) MADe Method (131.5, 656.1) 4 (1.91) 11 (5.6) 15 (7.18) 3 MADe Method (74.1, 749.5) 1 (.48) (.96) 3 (1.44) Media Rule (-43.87, ) () 1 (.48) 1 (.48) Case (Data set ): N=9 Method Iterval Left Right Total SD Method ( , ) () 8 (3.83) 8 (3.83) 3 SD Method ( , ) () 4 (1.91) 4 (1.91) Tukey s Method (1.5 IQR) (169.66, ) () 8 (3.83) 8 (3.83) Tukey s Method (3 IQR) ( , ) () 3 (1.44) 3 (1.44) Adjusted Boxplot (858.85, ) 5 (.39) (.96) 7 (3.35) MADe Method ( , ) 4 (1.91) (9.57) 4 (11.48) 3 MADe Method ( , 488.4) () 6 (.87) 6 (.87) Media Rule ( , 4939.) () 5 (.39) 5 (.39) Case 3 (Data set 3): N=17 Method Iterval Left Right Total SD Method ( , ) () 1 (.79) 1 (.79) 3 SD Method ( , 53.38) () 1 (.79) 1 (.79) Tukey s Method (1.5 IQR) (-96.38, ) () 3 (.36) 3 (.36) Tukey s Method (3 IQR) ( , ) () (1.57) (1.57) Adjusted Boxplot ( , ) 1 (.79) (1.57) 3 (.36) MADe Method (114.7, ) 1 (.79) 6 (4.7) 7 (5.51) 3 MADe Method ( , ) () 3 (.36) 3 (.36) Media Rule ( , ) () 3 (.36) 3 (.36) 34

41 Table 6 (cotiued) Case 4 (Data set 4): N=16 Method Iterval Left Right Total SD Method ( , ) 1 (.79) 4 (3.17) 5 (3.97) 3 SD Method (-14.89, ) () 1 (.79) 1 (.79) Tukey s Method (1.5 IQR) (-64.67, ) () (1.59) (1.59) Tukey s Method (3 IQR) (-495.4, ) () 1 (.79) 1 (.79) Adjusted Boxplot (1375.9, ) 1 (.79) 1 (.79) (1.59) MADe Method ( , 96.99) 1 (.79) 5 (3.97) 6 (4.76) 3 MADe Method ( , ) () (1.59) (1.59) Media Rule ( , ) () (1.59) (1.59) Overall, the results of the applicatios show similar patters to those i the simulatio study. First, whe data are skewed, the differece of the average percetage of outliers betwee the SD method ad the MADe method icreases. Secod, the MADe method classifies more observatios as outliers tha ay other method does. Third, i the mildly right skewed data set, Case, i which the adjusted boxplot is utilized, the umber of the left outliers is larger tha that of the right outliers. Fially, the iterval of the Media rule is betwee Tukey s method with 1.5 IQR ad Tukey s method with 3 IQR. As was show i the results of Case 3 ad Case 4, such methods with robust measures as the MADe method, Tukey s method, the Media rule, ad the Adjusted Boxplot are less affected by the extreme value tha the SD method, ad the iterval of the SD method becomes much arrower after the sigle extreme value is excluded form data set 3 tha other methods. With regard to the SD method, while oe observatio is foud i Case 3, five observatios are detected as outliers i Case 4. That is, whe there is a large gap betwee extreme values ad the rest of values as show i the data set 3, such outlier labelig methods with mea ad stadard deviatio as the SD method ad Z-Score may ot detect the possible outliers which other methods could detect. I the case of the two skewess measures, i.e., classical ad medcouple skewess, classical skewess, ulike medcouple skewess, is highly affected by eve a few extreme values. The classical skewess i data set 3 was 9.5, but it decreased to 1.8 i data set 4 with the most extreme value which was icluded i the data set 3 excluded, whereas the medcouple skewess decreased oly a little. 35

42 6. RECOMMENDATIONS Figure 11 shows a decisio makig flowchart at to which outlier labelig method ca be used i differet data situatios. First, it is ecessary to uderstad the data characteristics (explore data step). Whe a data set cosists of such subgroups as sex ad icome, it may be ecessary to check if its research variables have differet characteristics accordig to the subgroups. For example, i the case of detectig outliers i the adult height variable, it may be ecessary to adjust for sex sice the distributio of height ca vary by sex. I such a case, a appropriate approach may be to stratify by sex. All the labelig methods i this paper ca be applicable if a data set has a ormal distributio without a possible maskig problem or large gap betwee the majority of the data ad extreme values. If the data set has a ormal distributio with a possible maskig problem or large gap betwee the majority of the data ad extreme values, the Z-score ad SD method may be iappropriate to use sice these methods are highly sesitive to extreme values. The methods for the data whose distributio is symmetric but ot ormal, e.g., a biomodal distributio, are beyod the purview of this study. Tukey s method, the MADe method, the Media rule, ad the Adjusted Boxplot may be appropriate whe a data set is skewed, such as i a logormal distributio; however, amog these four methods, the Adjusted Boxplot especially takes ito accout the skewess of the data 3. 36

43 Start Explore data Yes Symmetric Distributio No Yes Normal Distributio No Maskig Problem / Large Gap Yes Maskig Problem/ Large Gap No Yes Cosideratio for skewess No Modified Z-Score Z-Score Not applicable Adjusted Boxplot Tukey s Method Tukey s Method Modified Z-Score MADe Method MADe Method SD Method Media Rule Media Rule Tukey s Method Adjusted Boxplot MADe Method Media Rule Adjusted Boxplot Figure 11: Flowchart of Outlier Labelig Methods 37

44 7. DISCUSSION AND CONCLUSIONS As show i the simulatio study, each method has differet measures to detect outliers ad shows differet behaviors accordig to the skewess ad sample size of the data. The SD methods use less robust measures, such as the mea ad stadard deviatio, which are highly affected by extreme values. Thus, their itervals have a tedecy to be iflated as the data icreases i skewess, ad cosequetly the average percetages of outliers chage less tha other types of methods such as the MADe, Tukey s method ad the Media rule. Three methods such as the MADe, Tukey s method ad the Media rule show similar patters i skewed data sice they employ robust measures to build their itervals. The total average percetages of outliers for these methods icrease whe data are skewed. Although the basic idea of the adjusted boxplot is similar to Tukey s method, it is differet i that the adjusted boxplot has skewess measure to take ito cosideratio. Thus, the total average percetage of outliers for the adjusted boxplot seldom chages, eve decreases very slightly, whe data are skewed. I additio, the rage of the percetages declies more rapidly tha other methods as the sample size icreases. The total average percetage of outliers for the method, cosequetly, becomes smaller tha other methods as data becomes skewed ad the sample size gets large. The simulatio results reported i Table 5 may ot be a exact idex of the outlier percetage for each method accordig to the skewess ad the sample size of the data as the realworld data may ot follow the same distributios employed i the simulatio study as was show i Chapter 5. However, uderstadig the geeral features which the methods show would be helpful i choosig the outlier labelig methods i ormal or skewed data. There ca be a gap betwee the majority ad a small fractio of the data i a skewed data set. I geeral, whe the observatios located i the small fractio apart from the majority of the data are cosidered target outliers, the likelihood of defiig them as outliers ca icrease as the distace of the gap icreases. However, if the gap is ot large eough, detectig outliers may 38

45 have differet results depedig o the methods. I such a case, it may be hard to geeralize how large the gap i each method should be i order to idetify the observatios i the small fractio as outliers sice data are diversely distributed. Aother method to detect outliers is the formal test based o specific distributio assumptios. This test defies the target outliers first, ad the examies whether or ot the outliers are true. Some formal tests may defie all of the observatios i the small fractio as outliers, whereas others may defie oly some of the last observatios i the tail of data distributio as outliers. Selectio of formal tests maily depeds o the umber ad the type of target outliers ad the type of data distributio. 1 I the future, formal tests i various distributios will be reviewed, compared, ad discussed. 39

46 APPENDIX A THE EXPECTATION, STANDARD DEVIATION AND SKEWNESS OF A LOGNORMAL DISTRIBUTION Let X deote a radom variable havig a logormal distributio, ad the its atural logarithm, Y = log(x ), has a ormal distributio. Aitchiso ad Brow (1957) ote that whe Y has mea value E (Y ) = μ, ad variace Var ( Y ) = σ, the expected value ad stadard deviatio of the origial variable X are as follows: E ( X ) = exp( μ + σ ) STDEV ( X ) = exp(μ + σ ) exp(μ + σ ) It is usually deoted by X ~ LOGN( μ, σ ), i.e., X ~ LOGN( μ, σ ) if ad oly if Y = log( X ) ~ N( μ, σ ). The skewess of X ca be deoted as follows: SKEW ( X ) = [exp( σ ) + ] exp( σ ) 1 Simple example: If X is a logormal radom variable with parameters μ ad σ, its atural logarithm, Y = log(x ), follows N( μ, σ ). Whe μ = ad σ =1 fory, the correspodig mea, stadard deviatio, ad skewess of X ca be determied from the followig: σ E ( X ) = exp( μ + ) = exp( +.5) =

47 STDEV ( X ) = exp(μ + σ ) exp(μ + σ ) = exp() exp(1) =.161 SKEW ( X ) = [exp( σ ) + ] exp( σ ) 1 = [exp(1) + ] exp(1) 1 = We may compute the theoretical cutoff value of the SD method usig this iformatio. For example, whe a certai variable, X, follows LOGN (,1), the theoretical lower ad upper cutoff value of the SD method i the variable are ± *

48 4 APPENDIX B MAXIMUM Z SCORES Shiffler (1988) showed that the maximum Z-Score depeds o sample size. Let x 1, x,,x - 1,x be a ordered radom sample of size from a populatio with ukow mea ad variace, ad let 1 x be zero. The sample variace of the sample is preseted as follows: 1 ) ( 1 = = x x S i i = 1 1 ) ( = = = = x x x x x i i i i i i = x x i i = = x x x i i ) ( 1 + = = x S Now, the Z-Score of the sample is maximized whe S is miimized. Here, whe 1 S is zero, S has the smallest value. That is, the maximum Z-Score ca be preseted as follows: S x x S x x Z ) ( ) ( max = =

49 ( x x ) = = ( 1) x It shows that o matter how large x is, the maximum Z-Score of the sample depeds o sample size. The smallest achievable value for the egative Z-Score is -( 1) 8. For several samples size, the maximum absolute Z-Score is as follows: N Z max

50 APPENDIX C CLASSICAL AND MEDCOUPLE (MC) SKEWNESS Skewess is a measure of the symmetry of data distributio. Classical skewess, usig the third momet of the distributio, i.e., defied as i ( x x) ( xi x) i= 1 Classical skewess = 3 ( N 1) s N i 3, 3, where ay variable x, is commoly used. It is where s is the sample stadard deviatio ad N is the sample size. If the value of skewess is egative, the distributio of the data is skewed to the left, ad if the value of skewess is positive, the distributio of the data is skewed to the right. Ay symmetric data has a zero value of skewess. Aother type of skewess is the medcouple (MC), a robust alterative to classical skewess 1, itroduced by Brys et al. (3). Whe X ={ x,..., 1, x x } is a data set, idepedetly sampled from a cotiuous uivariate distributio, ad it is sorted such as x 1 x... x, the MC of the data is defied as follows: MC = med h x i, x ), where the kerel fuctio h is give by: ( j ( x h x i, x ) = ( j j med ) ( med x ) k x j x i k i, where medk is the media of X, ad i ad j have to satisfy x i med k x j, ad x i x j. The value of the MC rages betwee -1 ad 1. If MC=, the data is symmetric. If MC>, the data has a right skewed distributio, whereas if MC<, the data has a left skewed distributio. 3 While classical skewess is highly affected by oe or more 44

51 extreme values of a data set sice it is based o the third momets of distributio, MC is robust to the extreme values. 1 Suppose that a example data set cosists of 1,, 3, 4, 5, 6, 7, 1, 15, 16 ad the computatio of the kerel fuctio h(x i,x j ) for the data set is as follows: (media = 5.5) x i x j Thus, MC = media h x i, x ) =.357. Several properties of the MC icludig other types of ( j robust skewess are preseted well i Brys et al. (3, 4). Figure 1 shows that MC skewess is more robust tha classical skewess as the sample size icreases, especially i skewed data. The skewess is the average value for repetitio i the previous simulatio study. Classical skewess i skewed data icreases ad becomes flat while the MC seldom chages over differet sample sizes, regardless of skewed data. This is because more extreme values are geerated from skewed distributios as the sample size gets large, ad classical skewess is sesitive to the extreme values. 45

52 classical skewess sample size SN LN (,.) LN (,.4) LN (,.6) LN (,.8) LN (, 1.) medcouple skewes sample size SN LN (,.) LN (,.4) LN (,.6) LN (,.8) LN (, 1.) Figure 1: Chage of the Two Types of Skewess Coefficiets Accordig to the Sample Size ad Data Distributio. (Note: This results came from the previous simulatio. All the values are i Table 5 ) 46

53 APPENDIX D BREAKDOWN POINT The otio of breakdow poit was itroduced by Hodges (1967) ad Hampel (1968, 1971). It is a robustess measure of a estimator such as the mea ad media or a related procedure usig the estimators. The breakdow poit of a estimator geerally ca be defied as the largest percetage of the data that ca be chaged ito arbitrary values without distortig the estimator 1. For example, if eve oe observatio of a uivariate data set is moved to ifiity, the estimators of the data set such as the mea ad variace go to ifiity. Thus, the breakdow poit of these estimators is zero. I cotrast, the breakdow poit of the media is approximately 5% ad it varies slightly accordig whether the sample size is odd or eve. The exact breakdow poit of the media is 5(1-1/) % ad 5(1-/) % for odd sample size ad eve sample size, respectively 1. Therefore, if the breakdow poit of a estimator is high, the estimator is robust. 47

54 APPENDIX E PROGRAM CODE FOR OUTLIER LABELING METHODS ##SPLUS Professioal ##Data set of Case 4 i Chapter 5, Applicatio, is used. ##SD METHOD #iterval sdl_mea(case4)-*stdev(case4) sdl sdu_mea(case4)+*stdev(case4) sdu #umber of outliers sdlrr_ifelse(case4<sdl,1,) sum(sdlrr) sdurr_ifelse(case4>sdu,1,) sum(sdurr) ##3SD METHOD #iterval sd3l_mea(case4)-3*stdev(case4) sd3l sd3u_mea(case4)+3*stdev(case4) sd3u #umber of outliers sd3lrr_ifelse(case4<sd3l,1,) sum(sd3lrr) sd3urr_ifelse(case4>sd3u,1,) sum(sd3urr) ##MADE media(case4) made_1.486*(media(abs(media(case4)-case4))) 48

55 ##MADE METHOD #iterval madel_media(case4)-*made madel madeu_media(case4)+*made madeu #umber of outliers madelrr_ifelse(case4<madel,1,) sum(madelrr) madeurr_ifelse(case4>madeu,1,) sum(madeurr) ##3MADE METHOD #iterval made3l_media(case4)-3*made made3l made3u_media(case4)+3*made made3u #umber of outlier made3lrr_ifelse(case4<made3l,1,) sum(made3lrr) made3urr_ifelse(case4>made3u,1,) sum(made3urr) sortf_sort(case4) sortfi_sortf[1:63] sortfj_sortf[64:16] medk_media(case4) c_matrix(,63,63) for (j i 1:63) { for (i i 1:63) { c[i,j]_((sortfj[j]-medk)-(medk-sortfi[i]))/(sortfj[j]-sortfi[i]) }} ##MC (medcouple skewess) mc_media(c,a.rm=t) ## CLASSICAL SKEWNESS clasicskew_mea((case4 - mea(case4))^3)/((mea((case4- mea(case4))^))^1.5) q1_quatile(case4,.5) q_quatile(case4,.5) q3_quatile(case4,.75) iqr_q3-q1 ##ADJUSTED BOXPLOT 49

56 #iterval adjl_q1-1.5*exp(-3.5*mc)*iqr adjl adju_q3+1.5*exp(4*mc)*iqr adju #umber of outliers adjlrr_ifelse(case4<adjl,1,) sum(adjlrr) adjurr_ifelse(case4>adju,1,) sum(adjurr) ## TUKEY S METHOD #ier fece tukey1.5l_q1-1.5*iqr tukey1.5l tukey1.5u_q3+1.5*iqr tukey1.5u #outer fece tukey3l_q1-3*iqr tukey3l tukey3u_q3+3*iqr tukey3u #umber of outliers (ier fece) tukey1.5lrr_ifelse(case4<tukey1.5l,1,) sum(tukey1.5lrr) tukey1.5urr_ifelse(case4>tukey1.5u,1,) sum(tukey1.5urr) #umber of outliers (outer fece) tukey3lrr_ifelse(case4<tukey3l,1,) sum(tukey3lrr) tukey3urr_ifelse(case4>tukey3u,1,) sum(tukey3urr) ##MEDIAN RULE #iterval medial_q-.3*iqr medial mediau_q+.3*iqr mediau #umber of outliers medialrr_ifelse(case4<medial,1,) sum(medialrr) mediaurr_ifelse(case4>mediau,1,) sum(mediaurr) 5

57 BIBLIOGRAPHY 1. Acua, E., Rodriguez, C. A Meta aalysis study of outlier detectio methods i classificatio. Techical paper, Departmet of Mathematics, Uiversity of Puerto Rico at Mayaguez, 4. Aitchiso, J., J.A.C. Brow. The Logormal distributio. Cambridge Uiversity Press, Cambridge, Bai, L., Egelhardt, M. Itroductio to probability ad mathematical statistics. d ed., Duxbury, Barett, V., Lewis, T. Outliers i statistical data. 3rd ed, Wiley, Bedre, SM., Kale, BK. Maskig effect o test for outliers i ormal sample. Biometrika, Vol. 74, No. 4 (Dec., 1987), Be-Gal, I. Outlier detectio. Data Miig ad Kowledge Discovery Hadbook: A Complete Guide for Practitioers ad Researchers, Kluwer Academic Publishers Bradt, R. Comparig classical ad resistat outlier rules. Joural of the America Statistical Associatio, Vol. 85, No. 41 (Dec., 199), Brys, G., Hubert, M., Rousseeuw, P.J. A robustificatio of idepedet compoet aalysis. Joural of Chemometrics 5 9. Brys, G., Hubert, M., Struyf, A. A Compariso of some ew measures of skewess. Developmets i Robust Statistics, ICORS 1, eds. R. Dutter, P. Filzmoser, U. Gather ad P.J. Rousseeuw, Spriger-Verlag: Heidelberg, pp :3 1. Brys, G., Hubert, M., Struyf, A. A robust measure of skewess. Joural of Computatioal ad Graphical Statistics 4; 13: Burke, S. Missig values, outliers, robust statistics & o-parametric methods. LC.GC Europe Olie Supplemet, statistics ad data aalysis, Carlig, K. Resistat outlier rules ad the o-gaussia case. Computatioal statistics ad data aalysis, vol 33,, pp

58 13. Clark, J. Determiig outliers. Hollis Uiversity, 4, available at Davies, L., Gather, U. The idetificatio of multiple outliers. Joural of the America Statistical Associatio, Vol. 88, No. 43 (Sep., 1993), Hampel, FR. A geeral qualitative defiitio of robustess. Aals of Mathematical Statistics, 4, Hampel, FR. Cotributios to the theory of robust estimatio. Ph.D. Thesis, Dept. Statistics, Uiv. Califoria, Berkeley Hartwig, F., Dearig, B.E. Exploratory data aalysis. Newberry Park, CA: Sage Publicatios, Ic.; Harvey, M. Prism statistics guide, versio 4., High, R. Dealig with outliers: How to maitai your data s itegrity. Uiversity of Orego,, available at Hoagli, D., Tukey, JW. Performace of some resistat rules for outlier labelig. Joural of the America Statistical Associatio, Vol. 81, No. 396 (Dec., 1986), Iglewicz, B., Hoagli, D. How to detect ad hadle outliers. ASQC Quality Press, Lethe, J. Chebychev ad empirical rules. Texas A&M Uiversity, 1996, available at 3. Marsh, GM. Stadard protocol for outlier aalysis of dissatisfactio Rate, Techical Report, Uiversity of Pittsburgh,. 4. Meyer, RK., Krueger, D. A miitab guide to statistics. d ed., Pretice Hall, Olsso, U. Cofidece iterval for the mea of a logormal distributio. Joural of Statistics Educatio, Vol. 13, No 1 (5) 6. Osbore, JW., Overbay, A. The power of outliers (ad why researchers should always check for them). Practical Assessmet, Research & Evaluatio, 9(6) Roser, B. Fudametals of biostatistics. 4th ed., Pacific Grove (CA), Duxbury, Schiffler RE. Maximum Z Score ad outliers. The America Statisticia, Vol. 4, No.1 (Feb., 1988), Siegel, A. Statistics ad data aalysis: A Itroductio, Wiley, New York, Stepheso, D. Evirometal statistics for climate researcher. Uiversity of readig, U.K., 4 5

59 31. Tukey, JW. Exploratory data aalysis. Addiso-Wesely, Vaderviere, E., Huber, M. A adjusted boxplot for skewed distributios. Compstat 4 graphics. 33. Zhou, X., Gao, S. Cofidece itervals for the logormal mea. Statistics i medicie, Vol. 16, ,