A Review and Comparison of Methods for Detecting Outliers in Univariate Data Sets

Size: px
Start display at page:

Download "A Review and Comparison of Methods for Detecting Outliers in Univariate Data Sets"

Transcription

1 A Review ad Compariso of Methods for Detectig Outliers i Uivariate Data Sets by Sogwo Seo BS, Kyughee Uiversity, Submitted to the Graduate Faculty of Graduate School of Public Health i partial fulfillmet of the requiremets for the degree of Master of Sciece Uiversity of Pittsburgh 6

2 UNIVERSITY OF PITTSBURGH Graduate School of Public Health This thesis was preseted by Sogwo Seo It was defeded o April 6, 6 ad approved by: Laura Cassidy, Ph D Assistat Professor Departmet of Biostatistics Graduate School of Public Health Uiversity of Pittsburgh Ravi K. Sharma, Ph D Assistat Professor Departmet of Behavioral ad Commuity Health Scieces Graduate School of Public Health Uiversity of Pittsburgh Thesis Director: Gary M. Marsh, Ph D Professor Departmet of Biostatistics Graduate School of Public Health Uiversity of Pittsburgh ii

3 Gary M. Marsh, Ph D A Review ad Compariso of Methods for Detectig Outliers i Uivariate Data Sets Sogwo Seo, M.S. Uiversity of Pittsburgh, 6 Most real-world data sets cotai outliers that have uusually large or small values whe compared with others i the data set. Outliers may cause a egative effect o data aalyses, such as ANOVA ad regressio, based o distributio assumptios, or may provide useful iformatio about data whe we look ito a uusual respose to a give study. Thus, outlier detectio is a importat part of data aalysis i the above two cases. Several outlier labelig methods have bee developed. Some methods are sesitive to extreme values, like the SD method, ad others are resistat to extreme values, like Tukey s method. Although these methods are quite powerful with large ormal data, it may be problematic to apply them to oormal data or small sample sizes without kowledge of their characteristics i these circumstaces. This is because each labelig method has differet measures to detect outliers, ad expected outlier percetages chage differetly accordig to the sample size or distributio type of the data. May kids of data regardig public health are ofte skewed, usually to the right, ad logormal distributios ca ofte be applied to such skewed data, for istace, surgical procedure times, blood pressure, ad assessmet of toxic compouds i evirometal aalysis. This paper reviews ad compares several commo ad less commo outlier labelig methods ad presets iformatio that shows how the percet of outliers chages i each method accordig to the skewess ad sample size of logormal distributios through simulatios ad applicatio to real data sets. These results may help establish guidelies for the choice of outlier detectio methods i skewed data, which are ofte see i the public health field. iii

4 TABLE OF CONTENTS 1. INTRODUCTION BACKGROUND OUTLIER DETECTION METHOD STATEMENT OF PROBLEM OUTLIER LABELING METHOD STANDARD DEVIATION (SD) METHOD Z-SCORE THE MODIFIED Z-SCORE TUKEY S METHOD (BOXPLOT) ADJUSTED BOXPLOT MAD E METHOD MEDIAN RULE SIMULATION STUDY AND RESULTS FOR THE FIVE SELECTED LABELING METHODS APPLICATION RECOMMENDATIONS DISCUSSION AND CONCLUSIONS APPENDIX A... 4 THE EXPECTATION, STANDARD DEVIATION AND SKEWNESS OF A LOGNORMAL DISTRIBUTION.4 APPENDIX B... 4 MAXIMUM Z SCORE.4 APPENDIX C CLASSICAL AND MEDCOUPLE (MC) SKEWNESS..44 iv

5 APPENDIX D BREAKDOWN POINT.47 APPENDIX E PROGRAM CODE FOR OUTLIER LABELING METHODS...48 BIBLIOGRAPHY v

6 LIST OF TABLES Table 1: Basic Statistic of a Simple Data Set... Table : Basic Statistic After Chagig 7 ito 77 i the Simple Data Set... Table 3: Computatio ad Maskig Problem of the Z-Score Table 4: Computatio of Modified Z-Score ad its Compariso with the Z-Score... 1 Table 5: The Average Percetage of Left Outliers, Right Outliers ad the Average Total Percet of Outliers for the Logormal Distributios with the Same Mea ad Differet Variaces (mea=, variace=.,.4,.6,.8, 1. ) ad the Stadard Normal Distributio with Differet Sample Sizes Table 6: Iterval, Left, Right, ad Total Number of Outliers Accordig to the Five Outlier Methods vi

7 LIST OF FIGURES Figure 1: Probability desity fuctio for a ormal distributio accordig to the stadard deviatio... 5 Figure : Theoretical Chage of Outliers Percetage Accordig to the Skewess of the Logormal Distributios i the SD Method ad Tukey s Method... 7 Figure 3: Desity Plot ad Dotplot of the Logormal Distributio (sample size=5) with Mea=1 ad SD=1, ad its Logarithm, Y=log(x)... 8 Figure 4: Boxplot for the Example Data Set Figure 5: Boxplot ad Dotplot. (Note: No outlier show i the boxplot) Figure 6: Chage of theiitervals of Two Differet Boxplot Methods Figure 7: Stadard Normal Distributio ad Logormal Distributios... Figure 8: Chage i the Outlier Percetages Accordig to the Skewess of the Data... Figure 9: Chage i the Total Percetages of Outliers Accordig to the Sample Size... 5 Figure 1: Histogram ad Basic Statistics of Case 1-Case Figure 11: Flowchart of Outlier Labelig Methods Figure 1: Chage of the Two Types of Skewess Coefficiets Accordig to the Sample Size ad Data Distributio. (Note: This results came from the previous simulatio. All the values are i Table 5 ) vii

8 1. INTRODUCTION This chapter cosists of two sectios: the Backgroud ad Outlier Detectio Method. I the Backgroud, basic ideas of a outlier are discussed such as defiitios, features, ad reasos to detect outliers. I the Outlier Detectio Method sectio, characteristics of the two kids of outlier detectio methods are described briefly: formal ad iformal tests. 1.1 BACKGROUND Observed variables ofte cotai outliers that have uusually large or small values whe compared with others i a data set. Some data sets may come from homogeeous groups; others from heterogeeous groups that have differet characteristics regardig a specific variable, such as height data ot stratified by geder. Outliers ca be caused by icorrect measuremets, icludig data etry errors, or by comig from a differet populatio tha the rest of the data. If the measuremet is correct, it represets a rare evet. Two aspects of a outlier ca be cosidered. The first aspect to ote is that outliers cause a egative effect o data aalysis. Osbome ad Overbay (4) briefly categorized the deleterious effects of outliers o statistical aalyses: 1) Outliers geerally serve to icrease error variace ad reduce the power of statistical tests. ) If o-radomly distributed, they ca decrease ormality (ad i multivariate aalyses, violate assumptios of sphericity ad multivariate ormality), alterig the odds of makig both Type I ad Type II errors. 3) They ca seriously bias or ifluece estimates that may be of substative iterest. The followig example simply shows how oe outlier ca highly distort the mea, variace, ad 95% cofidece iterval for the mea. Let s suppose there is a simple data set composed of data poits 1,, 3, 4, 5, 6, 7 ad its basic statistics are as show i Table 1. Now, 1

9 let s replace data poit 7 with 77. As show i Table, the mea ad variace of the data are much larger tha that of the origial data set due to oe uusual data value, 77. The 95% cofidece iterval for the mea is also much broader because of the large variace. It may cause potetial problems whe data aalysis that is sesitive to a mea or variace is coducted. Table 1: Basic Statistic of a Simple Data Set Mea Media Variace 95 % Cofidece Iterval for the mea [. to 6.] Table : Basic Statistic After Chagig 7 ito 77 i the Simple Data Set Mea Media Variace 95 % Cofidece Iterval for the mea [ to 39.74] The secod aspect of outliers is that they ca provide useful iformatio about data whe we look ito a uusual respose to a give study. They could be the extreme values sittig apart from the majority of the data regardless of distributio assumptios. The followig two cases are good examples of outlier aalysis i terms of the secod aspect of a outlier: 1) to idetify medical practitioers who uder- or over-utilize specific procedures or medical equipmet, such as a x-ray istrumet; ) to idetify Primary Care Physicias (PCPs) with iordiately high Member Dissatisfactio Rates (MDRs) (MDRs = the umber of member complaits / PCP practice size) compared to other PCPs. 3 I summary, there are two reasos for detectig outliers. The first reaso is to fid outliers which ifluece assumptios of a statistical test, for example, outliers violatig the ormal distributio assumptio i a ANOVA test, ad deal with them properly i order to improve statistical aalysis. This could be cosidered as a prelimiary step for data aalysis. The secod reaso is to use the outliers themselves for the purpose of obtaiig certai critical iformatio about the data as was show i the above examples.

10 1. OUTLIER DETECTION METHOD There are two kids of outlier detectio methods: formal tests ad iformal tests. Formal ad iformal tests are usually called tests of discordacy ad outlier labelig methods, respectively. Most formal tests eed test statistics for hypothesis testig. They are usually based o assumig some well-behavig distributio, ad test if the target extreme value is a outlier of the distributio, i.e., weather or ot it deviates from the assumed distributio. Some tests are for a sigle outlier ad others for multiple outliers. Selectio of these tests maily depeds o umbers ad type of target outliers, ad type of data distributio. 1 May various tests accordig to the choice of distributios are discussed i Barett ad Lewis (1994) ad Iglewicz ad Hoagli (1993). Iglewicz ad Hoagli (1993) reviewed ad compared five selected formal tests which are applicable to the ormal distributio, such as the Geeralized ESD, Kurtosis statistics, Shapiro-Wilk, the Boxplot rule, ad the Dixo test, through simulatios. Eve though formal tests are quite powerful uder well-behavig statistical assumptios such as a distributio assumptio, most distributios of real-world data may be ukow or may ot follow specific distributios such as the ormal, gamma, or expoetial. Aother limitatio is that they are susceptible to maskig or swampig problems. Acua ad Rodriguez (4) defie these problems as follows: Maskig effect: It is said that oe outlier masks a secod outlier if the secod outlier ca be cosidered as a outlier oly by itself, but ot i the presece of the first outlier. Thus, after the deletio of the first outlier the secod istace is emerged as a outlier. Swampig effect: It is said that oe outlier swamps a secod observatio if the latter ca be cosidered as a outlier oly uder the presece of the first oe. I other words, after the deletio of the first outlier the secod observatio becomes a o-outlyig observatio. May studies regardig these problems have bee coducted by Barett ad Lewis (1994), Iglewicz ad Hoagli (1993), Davies ad Gather (1993), ad Bedre ad Kale (1987). O the other had, most outlier labelig methods, iformal tests, geerate a iterval or criterio for outlier detectio istead of hypothesis testig, ad ay observatios beyod the iterval or criterio is cosidered as a outlier. Various locatio ad scale parameters are mostly employed i each labelig method to defie a reasoable iterval or criterio for outlier detectio. There are two reasos for usig a outlier labelig method. Oe is to fid possible outliers as a screeig device before coductig a formal test. The other is to fid the extreme values away 3

11 from the majority of the data regardless of the distributio. While the formal tests usually require test statistics based o the distributio assumptios ad a hypothesis to determie if the target extreme value is a true outlier of the distributio, most outlier labelig methods preset the iterval usig the locatio ad scale parameters of the data. Although the labelig method is usually simple to use, some observatios outside the iterval may tur out to be falsely idetified outliers after a formal test whe the outliers are defied as oly observatios that deviate from the assumig distributio. However, if the purpose of the outlier detectio is ot a prelimiary step to fid the extreme values violatig the distributio assumptios of the mai statistical aalyses such as the t-test, ANOVA, ad regressio, but maily to fid the extreme values away from the majority of the data regardless of the distributio, the outlier labelig methods may be applicable. I additio, for a large data set that is statistically problematic, e.g., whe it is difficult to idetify the distributio of the data or trasform it ito a proper distributio such as the ormal distributio, labelig methods ca be used to detect outliers. This paper focuses o outlier labelig methods. Chapter presets the possible problems whe labelig methods are applied to skewed data. I Chapter 3, seve outlier labelig methods are outlied. I Chapter 4, the average percetages of outliers i the stadard ormal ad log ormal distributios with the same mea ad differet variaces is computed to compare the outlier percetage of the selected five outlier labelig methods accordig to the degree of the skewess ad differet sample sizes. I Chapter 5, the five selected methods are applied to real data sets. 4

12 . STATEMENT OF PROBLEM Outlier-labelig methods such as the Stadard Deviatio (SD) ad the boxplot are commoly used ad are easy to use. These methods are quite reasoable whe the data distributio is symmetric ad moud-shaped such as the ormal distributio. Figure 1 shows that about 68%, 95%, ad 99.7% of the data from a ormal distributio are withi 1,, ad 3 stadard deviatios of the mea, respectively. If data follows a ormal distributio, this helps to estimate the likelihood of havig extreme values i the data 3, so that the observatio two or three stadard deviatios away from the mea may be cosidered as a outlier i the data. Figure 1: Probability desity fuctio for a ormal distributio accordig to the stadard deviatio. The boxplot which was developed by Tukey (1977) is aother very helpful method sice it makes o distributioal assumptios or does it deped o a mea or stadard deviatio. 19 The lower quartile (q1) is the 5th percetile, ad the upper quartile (q3) is the 75th percetile of the data. The iter-quartile rage (IQR) is defied as the iterval betwee q1 ad q3. 5

13 Tukey (1997) defied q1-(1.5*iqr) ad q3+(1.5*iqr) as ier feces, q1-(3*iqr) ad q3+(3*iqr) as outer feces, the observatios betwee a ier fece ad its earby outer fece as outside, ad aythig beyod outer feces as far out. 31 High () reamed the outside potetial outliers ad the far out problematic outliers. 19 The outside ad far out observatios ca also be called possible outliers ad probable outliers, respectively. This method is quite effective, especially whe workig with large cotiuous data sets that are ot highly skewed. 19 Although Tukey s method is quite effective whe workig with large data sets that are fairly ormally distributed, may distributios of real-world data do ot follow a ormal distributio. They are ofte highly skewed, usually to the right, ad i such cases the distributios are frequetly closer to a logormal distributio tha a ormal oe. 1 The logormal distributio ca ofte be applied to such data i a variety of forms, for istace, persoal icome, blood pressure, ad assessmet of toxic compouds i evirometal aalysis. I order to illustrate how the theoretical percetage of outliers chages accordig to the skewess of the data i the SD method (Mea ± SD, Mea ± 3 SD) ad Tukey s method, logormal distributios with the same mea () but differet stadard deviatios (.,.4,.6,.8, 1., 1.) are used for the data sets with differet degrees of skewess, ad the stadard ormal distributio is used for the data set whose skewess is zero. The computatio of the mea, stadard deviatio, ad skewess i a logormal distributio is i Appedix A. Accordig to Figure, the two methods show a differet patter, e.g., the outlier percetage of Tukey s method icreases, ulike the SD method. It shows that the results of outlier detectio may chage depedig o the outlier detectio methods or the distributio of the data. 6

14 Outlier Skewess SD Method (Mea ± SD) 3 SD Method (Mea ± 3 SD) Tukey's Method (1.5 IQR) Tukey's Method (3 IQR) Figure : Theoretical Chage of Outliers Percetage Accordig to the Skewess of the Logormal Distributios i the SD Method ad Tukey s Method Whe data are highly skewed or i other respects depart from a ormal distributio, trasformatios to ormality is a commo step i order to idetify outliers usig a method which is quite effective i a ormal distributio. Such a trasformatio could be useful whe the idetificatio of outliers is coducted as a prelimiary step for data aalysis ad it helps to make possible the selectio of appropriate statistical procedures for estimatig ad testig as well. 1 However, if a outlier itself is a primary cocer i a give study, as was show i a previous example i the idetificatio of medical practitioers who uder- or over-utilize such medical equipmet as x-ray istrumets, a trasformatio of the data could affect our ability to idetify outliers. For example, 5 radom samples (x) are geerated through statistical software R i order to show the effect of the trasformatio. The radom variable X has a logormal distributio (Mea=1, SD=1), ad its logarithm, Y=log(x), has a ormal distributio. If the observatios which are beyod the mea by two stadard deviatios are cosidered outliers, the expected outliers before ad after trasformatio are totally differet. As show i Figure 3, while three observatios which have large values are cosidered as outliers i the origial 5 radom samples(x), after log trasformatio of these samples, two observatios of small values appear to be outliers, ad the former large valued observatios are o loger cosidered to be outliers. The vertical lies i each graph represet cutoff values (Mea ± *SD). Lower ad 7

15 upper cutoff values are ( , ) ad ( ,.76336), respectively, i the logormal data(x) ad its logarithm(y). Although this approach is ot be affected by extreme values because it does ot deped o the extreme observatios after trasformatio, after a artificial trasformatio of the data, however, the data may be reshaped so that true outliers are ot detected or other observatios may be falsely idetified as outliers. 1 dlorm(x, 1, 1, )..1. dorm(y, 1, 1, ) x y x y Figure 3: Desity Plot ad Dotplot of the Logormal Distributio (sample size=5) with Mea=1 ad SD=1, ad its Logarithm, Y=log(x). Several methods to idetify outliers have bee developed. Some methods are sesitive to extreme values like the SD method, ad others are resistat to extreme values like Tukey s method. The objective of this paper is to review ad compare several commo ad less commo labelig methods for idetifyig outliers ad to preset iformatio that shows how the average percetage of outliers chages i each method accordig to the degree of skewess ad sample size of the data i order to help establish guidelies for the choice of outlier detectio methods i skewed data whe a outlier itself is a primary cocer i a give study. 8

16 3. OUTLIER LABELING METHOD This chapter reviews seve outlier labelig methods ad gives examples of simple umerical computatios for each test. 3.1 STANDARD DEVIATION (SD) METHOD The simple classical approach to scree outliers is to use the SD (Stadard Deviatio) method. It is defied as SD Method: x ± SD 3 SD Method: x ± 3 SD, where the mea is the sample mea ad SD is the sample stadard deviatio. The observatios outside these itervals may be cosidered as outliers. Accordig to the Chebyshev iequality, if a radom variable X with mea μ ad variace σ exists, the for ay k >, 1 P[ X μ kσ ] k 1 P[ X μ < kσ ] 1-, k > k the iequality [1-(1/k) ] eables us to determie what proportio of our data will be withi k stadard deviatios of the mea 3. For example, at least 75%, 89%, ad 94% of the data are withi, 3, ad 4 stadard deviatios of the mea, respectively. These results may help us determie the likelihood of havig extreme values i the data 3. Although Chebychev's therom is true for ay data from ay distributio, it is limited i that it oly gives the smallest proportio of observatios withi k stadard deviatios of the mea. I the case of whe the distributio of a 9

17 radom variable is kow, a more exact proportio of observatios ceterig aroud the mea ca be computed. For istace, if certai data follow a ormal distributio, approximately 68%, 95%, ad 99.7% of the data are withi 1,, ad 3 stadard deviatios of the mea, respectively; thus, the observatios beyod two or three SD above ad below the mea of the observatios may be cosidered as outliers i the data. The example data set, X, for a simple example of this method is as follows: 3., 3.4, 3.7, 3.7, 3.8, 3.9, 4, 4, 4.1, 4., 4.7, 4.8, 14, 15. For the data set, x = 5.46, SD=3.86, ad the itervals of the SD ad 3 SD methods are (-.5, 13.18) ad (-6.11, 17.4), respectively. Thus, 14 ad 15 are beyod the iterval of the SD method ad there are o outliers i the 3 SD method. 3. Z-SCORE Aother method that ca be used to scree data for outliers is the Z-Score, usig the mea ad stadard deviatio. Z i xi x =, where X i ~ N (µ, σ ), ad sd is the stadard deviatio of data. sd The basic idea of this rule is that if X follows a ormal distributio, N (µ, σ ), the Z follows a stadard ormal distributio, N (, 1), ad Z-scores that exceed 3 i absolute value are geerally cosidered as outliers. This method is simple ad it is the same formula as the 3 SD method whe the criterio of a outlier is a absolute value of a Z-score of at least 3. It presets a reasoable criterio for idetificatio of the outlier whe data follow the ormal distributio. Accordig to Shiffler (1988), a possible maximum Z-score is depedet o sample size, ad it is computed as ( 1) /. The proof is give i Appedix B. Sice o z-score exceeds 3 i a sample size less tha or equal to 1, the z-score method is ot very good for outlier labelig, particularly i small data sets 1. Aother limitatio of this rule is that the stadard deviatio ca be iflated by a few or eve a sigle observatio havig a extreme value. Thus it ca cause a maskig problem, i.e., the less extreme outliers go udetected because of the most extreme outlier(s), ad vice versa. Whe maskig occurs, the outliers may be eighbors. Table 3 shows 1

18 a computatio ad maskig problem of the Z-Score method usig the previous example data set, X. Table 3: Computatio ad Maskig Problem of the Z-Score i Case 1 ( x =5.46, sd=3.86) Case ( x =4.73, sd=.8) x i Z-Score x i Z-Score For case 1, with all of the example data icluded, it appears that the values 14 ad 15 are outliers, yet o observatio exceeds the absolute value of 3. For case, with the most extreme value, 15, amog example data excluded, 14 is cosidered a outlier. This is because multiple extreme values have artificially iflated stadard deviatios. 3.3 THE MODIFIED Z-SCORE Two estimators used i the Z-Score, the sample mea ad sample stadard deviatio, ca be affected by a few extreme values or by eve a sigle extreme value. To avoid this problem, the media ad the media of the absolute deviatio of the media (MAD) are employed i the 11

19 modified Z-Score istead of the mea ad stadard deviatio of the sample, respectively (Iglewicz ad Hoagli, 1993). MAD = media{ x ~ x }, where x~ is the sample media. i The modified Z-Score ( M ) is computed as M i i.6745( x ~ i x ) =, where E( MAD )=.675 σ for large ormal data. MAD Iglewicz ad Hoagli (1993) suggested that observatios are labeled outliers whe M >3.5 through the simulatio based o pseudo-ormal observatios for sample sizes of i 1,, ad 4. 1 The M i score is effective for ormal data i the same way as the Z-score. Table 4: Computatio of Modified Z-Score ad its Compariso with the Z-Score i x i Z-Score modified Z-Score Table 4 shows the computatio of the modified Z-Score ad its compariso with the Z- Score of the previous example data set. While o observatio is detected as a outlier i the Z- Score, two extreme values, 14 ad 15, are detected as outliers at the same time i the modified Z- Score sice this method is less susceptible to the extreme values. 1

20 3.4 TUKEY S METHOD (BOXPLOT) Tukey s (1977) method, costructig a boxplot, is a well-kow simple graphical tool to display iformatio about cotiuous uivariate data, such as the media, lower quartile, upper quartile, lower extreme, ad upper extreme of a data set. It is less sesitive to extreme values of the data tha the previous methods usig the sample mea ad stadard variace because it uses quartiles which are resistat to extreme values. The rules of the method are as follows: 1. The IQR (Iter Quartile Rage) is the distace betwee the lower (Q1) ad upper (Q3) quartiles.. Ier feces are located at a distace 1.5 IQR below Q1 ad above Q3 [Q1-1.5 IQR, Q3+1.5IQR]. 3. Outer feces are located at a distace 3 IQR below Q1 ad above Q3 [Q1-3 IQR, Q3+3 IQR]. 4. A value betwee the ier ad outer feces is a possible outlier. A extreme value beyod the outer feces is a probable outlier. There is o statistical basis for the reaso that Tukey uses 1.5 ad 3 regardig the IQR to make ier ad outer feces. For the previous example data set, Q1=3.75, Q3=4.575, ad IQR=.85. Thus, the ier fece is [.45, 5.85] ad the outer fece is [1.18, 7.13]. Two extreme values, 14 ad 15, are idetified as probable outliers i this method. Figure 4 is a boxplot geerated usig the statistical software STATA for the example data set Figure 4: Boxplot for the Example Data Set 13

21 While previous methods are limited to moud-shaped ad reasoably symmetric data such as the ormal distributio 1, Tukey s method is applicable to skewed or o moud-shaped data sice it makes o distributioal assumptios ad it does ot deped o a mea or stadard deviatio. However, Tukey s method may ot be appropriate for a small sample size 1. For example, let s suppose that a data set cosists of data poits 145, 147, 9, 93, 418, 158, ad 9. A simple distributio of the data usig a Boxplot ad Dotplot are show i Figure 5. Although 158 ad 9 may appear to be outliers i the dotplot, o observatio is show as a outlier i the boxplot. 1,, 3, Figure 5: Boxplot ad Dotplot. (Note: No outlier show i the boxplot) 3.5 ADJUSTED BOXPLOT Although the boxplot proposed by Tukey (1977) may be applicable for both symmetric ad skewed data, the more skewed the data, the more observatios may be detected as outliers, 3 as show i Figure. This results from the fact that this method is based o robust measures such as lower ad upper quartiles ad the IQR without cosiderig the skewess of the data. Vaderviere ad Huber (4) itroduced a adjusted boxplot takig ito accout the medcouple (MC) 3, a robust measure of skewess for a skewed distributio. 14

22 Whe X ={ x 1, x,..., x } is a data set idepedetly sampled from a cotiuous uivariate distributio ad it is sorted such as j i x1 x... x, the MC of the data is defied as ( x j med k ) ( med k xi ) MC( x 1,..., x ) = med,where medk is the media of X, ad x x i ad j have to satisfy x i med k follows (G. Bray et al. (5)): x j, ad x i x j. The iterval of the adjusted boxplot is as [L, U] = [Q * exp (-3.5MC) * IQR, Q * exp (4MC) * IQR] if MC = [Q * exp (-4MC) * IQR, Q * exp (3.5MC) * IQR] if MC, where L is the lower fece, ad U is the upper fece of the iterval. The observatios which fall outside the iterval are cosidered outliers. The value of the MC rages betwee -1 ad 1. If MC=, the data is symmetric ad the adjusted boxplot becomes Tukey s box plot. If MC>, the data has a right skewed distributio, whereas if MC<, the data has a left skewed distributio. 3 MC ad a brief compariso of classical ad MC skewess are i Appedix C. A simple example for computatio of For the previous example data set, Q1=3.75, Q3=4.575, IQR=.85, ad MC=.43. Thus, the iterval of the adjusted boxplot is [3.44, 11.6]. Two extreme values, 14 ad 15, ad the two smallest values, 3. ad 3.4, are idetified as outliers i this method. Figure 6 shows the chage of the itervals of two boxplot methods, Tukey s method ad the adjusted boxplot, for the example data set. The vertical dotted lies are the lower ad upper boud of the iterval of each method. Although the example data set is artificial ad is ot large eough to explai their differece, we ca see a geeral tred that the iterval of the adjusted boxplot, especially the upper fece, moves to the side of the skewed tail, compared to Tukey s method. 15

23 Ier feces of Tukey Method (Q1-1.5*IQR, Q3+1.5*IQR) Outer feces of Tukey Method (Q1-3*IQR, Q3+3IQR) Sigle fece of adjusted box plot (Q1-1.5 * exp (-3.5MC) * IQR, Q3+1.5 * exp (4MC) * IQR) Figure 6: Chage of theiitervals of Two Differet Boxplot Methods (Tukey s Method vs. the Adjusted Boxplot) Vaderviere ad Huber (4) computed the average percetage of outliers beyod the lower ad upper fece of two types of boxplots, the adjusted Boxplot ad Tukey s Boxplot, for several distributios ad differet sample sizes. I the simulatio, less observatios, especially i the right tail, are classified as outliers compared to Tukey s method whe the data are skewed to the right. 3 I the case of a mildly right-skewed distributio, the lower fece of the iterval may move to the right ad more observatios i the left side will be classified as outliers compared to Tukey s method. This differece maily comes from a decrease i the lower fece ad a icrease i the upper fece from Q1 ad Q3, repectively. 3 16

24 3.6 MAD E METHOD The MAD e method, usig the media ad the Media Absolute Deviatio (MAD), is oe of the basic robust methods which are largely uaffected by the presece of extreme values of the data set. 11 This approach is similar to the SD method. However, the media ad MAD e are employed i this method istead of the mea ad stadard deviatio. The MAD e method is defied as follows; MAD e Method: Media ± MAD e 3 MAD e Method: Media ± 3 MAD e, where MAD e =1.483 MAD for large ormal data. MAD is a estimator of the spread i a data, similar to the stadard deviatio 11, but has a approximately 5% breakdow poit like the media 1. The otio of breakdow poit is delieated i Appedix D. MAD= media ( x i media(x) i=1,,, ) Whe the MAD value is scaled by a factor of 1.483, it is similar to the stadard deviatio i a ormal distributio. This scaled MAD value is the MAD e. For the example data set, the media=4, MAD=.3, ad MAD e =.44. Thus, the itervals of the MAD e ad 3 MAD e methods are [3.11, 4.89] ad [.67, 5.33], respectively. Sice this approach uses two robust estimators havig a high breakdow poit, i.e., it is ot uduly affected by extreme values eve though a few observatios make the distributio of the data skewed, the iterval is seldom iflated, ulike the SD method. 3.7 MEDIAN RULE The media is a robust estimator of locatio havig a approximately 5% breakdow poit. It is the value that falls exactly i the ceter of the data whe the data are arraged i order. 17

25 That is, if x 1, x,, x is a radom sample sorted by order of magitude, the the media is defied as: Media, ~ x = x m whe is odd x~ = (x m +x m+1 )/ whe is eve, where m=roud up (/) For a skewed distributio like icome data, the media is ofte used i describig the average of the data. The media ad mea have the same value i a symmetrical distributio. Carlig (1998) itroduces the media rule for idetificatio of outliers through studyig the relatioship betwee target outlier percetage ad Geeralized Lambda Distributios (GLDs). GLDs with differet parameters are used for various moderately skewed distributios 1. The media substitutes for the quartiles of Tukey s method, ad a differet scale of the IQR is employed i this method. It is more resistat ad its target outlier percetage is less affected by sample size tha Tukey s method i the o-gaussia case 1. The scale of IQR ca be adjusted depedig o which target outlier percetage ad GLD are selected. I my paper,.3 is chose as the scale of IQR; whe the scale is applied to ormal distributio, the outlier percetage turs out to be betwee Tukey s method of 1.5 IQR ad that of 3 IQR, i.e.,. %. It is defied as: [C 1, C ]=Q ±.3 IQR, where Q is the sample media. For the example data set, Q=4, ad IQR=.85. Thus, the iterval of this method is [.5, 5.96]. 18

26 4. SIMULATION STUDY AND RESULTS FOR THE FIVE SELECTED LABELING METHODS Most itervals or criteria to idetify possible outliers i outlier labelig methods are effective uder the ormal distributio. For example, i the case of a well-kow labelig method such as the SD ad 3 SD methods ad the Boxplot (1.5 IQR), the expected percetages of observatios outside the iterval are 5%,.3%, ad.7%, respectively, uder large ormal samples. Although these methods are quite powerful with large ormal data, it may be problematic to apply them to o-ormal data or small sample sizes without iformatio about their characteristics i these circumstaces. This is because each labelig method has differet measures to detect outliers, ad expected outlier percetages chage differetly accordig to the sample size or distributio type of the data. The purpose of this simulatio is to preset the expected percetage of the observatios outside of the iterval of several labelig methods accordig to the sample size ad the degree of the skewess of the data usig the logormal distributio with the same mea ad differet variaces. Through this simulatio, we ca kow ot oly the possible outlier percetage of several labelig methods but also which method is more robust accordig to the above two factors, skewess ad sample size. The simulatio proceeds as follows: Five labelig methods are selected: the SD Method, the MADe Method, Tukey s Method (Boxplot), Adjusted Boxplot, ad the Media Rule. The Z-Score ad modified Z-Score are ot cosidered because their criteria to defie a outlier are based o the ormal distributio. Average outlier percetages of five labelig methods i the stadard ormal (,1) ad logormal distributios with the same mea ad differet variaces (mea=, variace=.,.4,.6,.8, 1 ) are computed. For each distributio, 1 replicatios of sample sizes ad 5, 3 replicatios of the sample size 1, ad 1 replicatios of the sample sizes 3 ad 5 are cosidered. To illustrate the shape of each distributio, i.e., the degree of skewess of the data, 19

27 5 radom observatios were geerated from the distributios, ad their desity plots ad skewess are as show i Figure 7. Desity Value...4 Stadard Normal cs=.15 mc=.53 Desity Value. 1.. Logormal(,.) cs=.7 mc= x x Desity Value..4.8 Logormal(,.4) cs=1.56 mc=.6 Desity Value..4.8 Logormal(,.6) cs=.559 mc= x x Logormal(,.8) cs=3.999 mc=.379 Logmormal(,1.) cs=5.99 mc=.446 Desity Value..4 Desity Value x x Figure 7: Stadard Normal Distributio ad Logormal Distributios (cs=classical skewess, mc=medcouple skewess) Figures 8 ad 9 visually show the characteristics of the five labelig methods accordig to the sample size ad skewess of the data usig the logormal distributio. All the values of the Figures icludig their stadard error of the average percetage are reported i Table 5. The results of this simulatio are as follows: 1. The MADe method classifies more observatios as outliers tha ay other method. This method approaches the SD method i large ormal data; however, as the data icreases i skewess, the differece i outlier percetages betwee the MADe method ad the SD method

28 becomes larger sice the locatio ad scale measures such as the media ad MADe become the same as the mea ad stadard variace of the SD method whe data follows a ormal distributio with a large sample size. The MADe, Tukey s method, ad the Media rule icrease i the total average percetages of outliers the more skewed the data, while the SD method ad adjusted boxplot seldom chage over differet sample sizes.. The Media rule classifies less observatios tha Tukey s 1.5 IQR method ad more observatios tha Tukey s 3 IQR method. 3. The decrease rage of the total outlier percetage of the adjusted boxplot is larger tha other methods as the sample size icreases. 4. Most methods except the adjusted boxplot show similar patters i the average outlier percetages o the left side of the distributio. They decrease i left outlier percetage rapidly, especially i MADe ad SD methods, the more skewed the data; however, the adjusted boxplot decreases slowly i sample sizes over 3. Differet patters of the adjusted boxplot, e.g., icrease i left outlier percetage i small sample sizes, may be due to the followig: The left fece of the iterval may move to the right side because of the MC skewess ad a few observatios may be distributed outside the left fece by chace. Although the umber of the observatios is small, the ratio i a small sample size could large. This may affect a icrease i the average of the percetage of outliers o the left of the distributio. The adjusted boxplot may still detect observatios o the left side of the distributio i right skewed data, especially mildly skewed data; however, the average percetages are quiet low. 5. The MADe, Tukey s method, ad the Media rule icrease i the percetage of outliers o the right side of the distributio as the skewess of the data icreases while the SD method ad adjusted boxplot seldom chage i each sample size (the SD method icreases slightly ad plateaus). The right fece of the itervals of both methods, the SD method ad adjusted boxplot, move to the right side of the distributio as the skewess of the data icreases. Sice the adjusted boxplot takes ito accout the skewess of the data, its right fece of the iterval moves more to the side of the skewed tail, here the right side of the distributio, as the skewess icreases. O the other had, the iterval of the SD method is just iflated because of the extreme values. 1

29 Sample size Sample size 5 Figure 8: Chage i the Outlier Percetages Accordig to the Skewess of the Data

30 Sample1 Sample size 3 Figure 8 (cotiued) 3

31 Sample size 5 Figure 8 (cotiued) 4

32 Figure 9: Chage i the Total Percetages of Outliers Accordig to the Sample Size 5

33 Figure 9 (cotiued) 6

34 Table 5: The Average Percetage of Left Outliers, Right Outliers ad the Average Total Percet of Outliers for the Logormal Distributios with the Same Mea ad Differet Variaces (mea=, variace=.,.4,.6,.8, 1. ) ad the Stadard Normal Distributio with Differet Sample Sizes. Distributio SN LN (,.) LN (,.4) LN (,.6) CS.6 (.15) -.17 (.1) -.6 (.13).6 (.15) -.8 (.1).436 (.16).57 (.1).574 (.18).64 (.).69 (.15).864 (.) 1.6 (.17) (.7) 1.51 (.33) 1.33 (.5) 1.1 (.4) 1.63 (.4) (.39).1 (.63).199 (.64) MC -.4 (.7) -.9 (.5).1 (.6).4 (.6).4 (.5).84 (.7).86 (.5).79 (.6).93 (.6).94 (.4).161 (.7).17 (.5).181 (.7).167 (.6).17 (.5).19 (.7).5 (.5).51 (.6).54 (.7).55 (.5) Left (.8).176 (.53).6 (.66).67 (.73).66 (.51).555 (.5).71 (.37).73 (.5).676 (.44).594 (.35).95 (.).4 (.9). (.8).7 (.5). (.) () () () () () SD Method MADe Method Mea ± SD Mea ± 3 SD Media ± MADe Media ± 3 MADe Right Total Left Right Total Left Right Total Left Right Total (.83) (.11) (.1) (.11) (.16) (.15) (.156) (.41) (.66) (.73) (.19) (.5) (.63) (.13) (.1) (.17) (.95) (.88) (.141) (.3) (.7) (.45) (.6) (.79) (.17) (.) (.6) (.115) (.19) (.184) (.3) (.36) (.55) (.6) (.86) (.19) (.1) (.6) (.11) (.99) (.173) (.9) (.8) (.4) (.47) (.59) (.16) (.17) (.5) (.78) (.8) (.133) (.19) (.19) (.9) (.9) (.95) () (.31) (.31) (.11) (.183) (.7) (.34) (.18) (.119) (.55) (.59) () (.8) (.8) (.6) (.114) (.141) (.8) (.57) (.59) (.73) (.76) (.3) (.38) (.38) (.77) (.139) (.168) (.9) (.65) (.67) (.71) (.81) () (.35) (.35) (.6) (.16) (.185) () (.68) (.68) (.64) (.65) () (.9) (.9) (.4) (.116) (.13) () (.51) (.51) (.9) (.9) () (.55) (.55) (.91) (.197) (.5) (.3) (.141) (.144) (.55) (.54) () (.37) (.37) (.5) (.17) (.133) (.) (.84) (.85) (.73) (.73) () (.44) (.44) (.3) (.168) (.173) () (.11) (.11) (.65) (.66) () (.46) (.46) (.14) (.158) (.163) () (.94) (.94) (.56) (.57) () (.3) (.3) (.5) (.149) (.151) () (.74) (.74) (.84) (.84) () (.69) (.69) (.4) (.16) (.4) (.5) (.164) (.165) (.56) (.56) () (.38) (.38) (.11) (.14) (.14) () (.15) (.15) (.74) (.74) () (.5) (.5) (.1) (.17) (.171) () (.133) (.133) (.86) (.86) () (.51) (.51) () (.178) (.178) () (.146) (.146) (.68) (.68) () (.47) (.47) () (.145) (.145) () (.16) (.16) 7

35 8 Table 5 (cotiued) SD Method MADe Method Mea ± SD Mea ± 3 SD Media ± MADe Media ± 3 MADe Distributio CS MC Left Right Total Left Right Total Left Right Total Left Right Total 1.56 (.4).31 (.7).5 (.5) 5.71 (.81) (.81) () 1.86 (.76) 1.86 (.76).95 (.31) 13.4 (.18) (.4).5 (.5) (.191) 7.9 (.191) 5.16 (.3).315 (.5) () 5.5 (.58) 5.5 (.58) ().13 (.38).13 (.38).6 (.4) 1.93 (.143) (.144) () 7.33 (.1) 7.33 (.1) (.58).314 (.7) () (.84) (.84) ().177 (.49).177 (.49) () (.183) (.183) () 7.83 (.151) 7.83 (.151) (.15).37 (.7) () 4.5 (.13) 4.5 (.13) () (.53) (.53) () 1.75 (.193) 1.75 (.193) () 7.17 (.16) 7.17 (.16) LN (,.8) 5.98 (.96).34 (.5) () 4.48 (.76) 4.48 (.76) () 1.94 (.41) 1.94 (.41) () 1.84 (.13) 1.84 (.13) () 7.9 (.1) 7.9 (.1) (.6).353 (.7) () 6.3 (.78) 6.3 (.78) ().455 (.79).455 (.79).5 (.5) (.17) (.18). (.16) (.195) (.195) 5.66 (.34).384 (.5) () 5.48 (.6) 5.48 (.6) ().486 (.37).486 (.37) () (.15) (.15) () 1.14 (.133) 1.14 (.133) (.79).4 (.6) () 4.76 (.96) 4.76 (.96) ().73 (.51).73 (.51) () (.19) (.19) () (.165) (.165) (.183).399 (.7) () 4.7 (.18) 4.7 (.18) ().13 (.6).13 (.6) () 15.6 (.11) 15.6 (.11) () 9.67 (.187) 9.67 (.187) LN (, 1.) (.155).394 (.5) () 4.15 (.85) 4.15 (.85) () 1.99 (.46) 1.99 (.46) () (.137) (.137) () 9.73 (.16) 9.73 (.16) Tukey s Method Adjusted Boxplot Media Rule Q1-1.5 IQR / Q3+1.5 IQR Q1-3 IQR / Q3+3 IQR Q1-1.5exp(-3.5mc)/ Q3+1.5exp(4mc) Q ±.3 IQR Distributio CS MC Left Right Total Left Right Total Left Right Total Left Right Total.6 (.15) -.4 (.7) 1.1 (.83) 1.17 (.89).7 (.137).65 (.19).4 (.14).15 (.6).39 (.135).75 (.153) 5.14 (.178).615 (.61).685 (.67) 1.3 (.1) (.1) -.9 (.5).74 (.43).66 (.41) (.65).6 (.3). (.).8 (.4) 1.46 (.9).7 (.11) 3.53 (.11).9 (.8).36 (.4).58 (.39) (.13).1 (.6).537 (.49).51 (.49) 1.47 (.78).3 (.3).3 (.3).7 (.5) 1.5 (.11) 1.18 (.17).3 (.15).18 (.6).183 (.8).363 (.4) 3.6 (.15).4 (.6).4 (.45).363 (.38).783 (.61) () () ().59 (.77).647 (.93) 1.37 (.98).17 (.5).13 (.4).57 (.35) SN (.1).4 (.5).34 (.9).354 (.3).696 (.47) (). (.). (.).564 (.71).468 (.57).89 (1.17).11 (.16).11 (.16). (.3)

36 9 Table 5 (cotiued) Tukey s Method Adjusted Boxplot Media Rule Q1-1.5 IQR / Q3+1.5 IQR Q1-3 IQR / Q3+3 IQR Q1-1.5exp(-3.5mc)/ Q3+1.5exp(4mc) Q ±.3 IQR Distributio CS MC Left Right Total Left Right Total Left Right Total Left Right Total.436 (.16).84 (.7).415 (.5).9 (.113).75 (.137) ().1 (.33).1 (.33).75 (.146).395 (.143) 5.1 (.177).19 (.35) (.98) (.111) 5.57 (.1).86 (.5).146 (.) 1.86 (.67) 1.95 (.75) ().18 (.15).18 (.15) (.13) (.91) 3.41 (.118).8 (.8) (.5) 1.14 (.54) (.18).79 (.6).63 (.1) 1.6 (.76) (.78) ().63 (.15).63 (.15).95 (.11) 1.1 (.13).5 (.15).3 (.3).913 (.55).917 (.55) 3.64 (.).93 (.6). (.9) (.86) 1.67 (.86) ().77 (.16).77 (.16).8 (.13).543 (.68) (.97) ().94 (.64).94 (.64) LN (..) 5.69 (.15).94 (.4).1 (.6) 1.51 (.6) 1.54 (.6) ().36 (.9).36 (.9).47 (.7).356 (.39).88 (.66) ().838 (.44).838 (.44).864 (.).161 (.7).145 (.33) 3.85 (.131) 3.95 (.139) ().755 (.63).755 (.63).785 (.153).16 (.13) (.175).5 (.11).87 (.1).895 (.11) (.17).17 (.5).1 (.4) (.88) 3.56 (.88) ().56 (.34).56 (.34).38 (.11) 1.54 (.87) 3.54 (.119) ().538 (.76).538 (.76) (.7).181 (.7) () 3.3 (.15) 3.3 (.15) ().373 (.38).373 (.38) (.143).717 (.71).153 (.143) ().143 (.9).143 (.9) (.33).167 (.6) ().93 (.95).93 (.95) ().363 (.4).363 (.4).587 (.98).553 (.61) 1.14 (.98) ().47 (.83).47 (.83) LN (,.4) (.5).17 (.5) () 3.78 (.77) 3.78 (.77) ().41 (.3).41 (.3).4 (.59).514 (.48).916 (.65) ().54 (.67).54 (.67) 1.1 (.4).19 (.7).1 (.7) 5.5 (.151) 5.15 (.15) () 1.48 (.86) 1.48 (.86) 3.75 (.169) 1.94 (.117) 5.15 (.181) () (.139) (.139) (.4).5 (.5) () 4.8 (.95) 4.8 (.95) () 1.7 (.5) 1.7 (.5).178 (.13) (.69) 3.33 (.15) () 4.98 (.87) 4.98 (.87) (.39).51 (.6) () (.18) (.18) () 1.15 (.69) 1.15 (.69) (.134).767 (.77).163 (.136) () (.119) (.119) 3.1 (.63).54 (.7) () 4.81 (.13) 4.81 (.13) () (.61) (.61).633 (.18).593 (.66) 1.7 (.13) () 3.97 (.119) 3.97 (.119) LN (,.6) (.64).55 (.5) () 4.59 (.93) 4.59 (.93) () 1.7 (.48) 1.7 (.48).5 (.74).496 (.58) 1.16 (.78) () 3.7 (.8) 3.7 (.8) 1.56 (.4).31 (.7).1 (.1) (.16) 6.85 (.163) ().595 (.113).595 (.113) 3.46 (.177) (.1) (.191) () 5.91 (.157) 5.91 (.157) 5.16 (.3).315 (.5) () (.1) (.1) ().8 (.68).8 (.68).134 (.1) 1.18 (.74) 3.35 (.17) () (.98) (.98) (.58).314 (.7) () (.131) (.131) ()..7 (.84)..7 (.84) 1.8 (.147).99 (.93).7 (.158) () 5.65 (.18) 5.65 (.18) (.15).37 (.7) () 6.67 (.137) 6.67 (.137) ().153 (.75).153 (.75).59 (.15).64 (.6) 1.3 (.119) () (.134) (.134) LN (,.8) 5.98 (.96).34 (.5) () (.113) (.113) () (.68) (.68).4 (.56).53 (.5).774 (.63) () (.13) (.13)

37 Table 5 (cotiued) Distributio CS MC LN (, 1.) (.6).66 (.34) 3.86 (.79) (.183) 4.5 (.155).353 (.7).384 (.5).4 (.6).399 (.7).394 (.5) Q1-1.5 IQR / Q3+1.5 IQR Left () () () () () Right 8.37 (.166) 8.16 (.11) (.144) 7.73 (.158) 7.68 (.1) Tukey s Method Adjusted Boxplot Media Rule Total 8.37 (.166) 8.16 (.11) (.144) 7.73 (.158) 7.68 (.1) Left () () () () () Q1-3 IQR / Q3+3 IQR Right 4.5 (.133) (.83) (.1) 3.3 (.11) (.75) Total 4.5 (.133) (.83) (.1) 3.3 (.11) (.75) Left (.179) (.11) 1. (.135).43 (.114).134 (.4) Q1-1.5exp(-3.5mc)/ Q3+1.5exp(4mc) Right.385 (.134) 1.41 (.76).847 (.74).687 (.6).616 (.5) Total 5.57 (.197) 3.38 (.17).47 (.138) 1.11 (.116).75 (.58) Left () () () () () Q ±.3 IQR Right (.163) (.17) 7.63 (.143) 7.4 (.148) (.116) Total (.163) (.17) 7.63 (.143) 7.4 (.148) (.116) (stadard error of the average percetage of outliers) 3

38 5. APPLICATION I this chapter the five selected outlier labelig methods are applied to three real data sets ad oe modified data set of oe of the three real data sets. These real data sets are provided by Gateway Health Pla, a maaged care alterative to the Departmet of Public Welfare s Medical Assistace Program i Pesylvaia. These data sets are part of Primary Care Provider (PCP) s basic iformatio which is eeded to idetify providers (PCPs) associated with Member Dissatisfactio Rates (MDRs = the umber of member complaits/pcp practice size) that are uusually high compared with other PCPs of similar sized practices 3. Case 1 (data set 1) is visit per 1 office med, ad its distributio is ot very differet from the ormal distributio. Case (data set ) is Scripts per 1 Rx, ad its distributio is mildly skewed to the right. Case 3 (data set 3) is Svcs per 1 early child im, ad its distributio is highly skewed to the right because of oe observatio which has a extremely large value. Case 4 (data set 4) is the data set which is modified from the data set 3 by meas of excludig the most extreme value from the data set 3 to see the possible effect of the oe extreme outlier over the outlier labelig methods. Figure 1 shows the basic statistics ad distributio of each data set (Case 1-Case 4). Desity 1.e-4.e-4 3.e-4 4.e case 1 Figure 1: Histogram ad Basic Statistics of Case 1-Case 4 Mi: 3.8 1st Qu.: 33.3 Mea: Media: rd Qu.: Max: Total N: 9 Variace: Std Dev.: SE Mea: LCL Mea: UCL Mea: Skewess:.597 Kurtosis:.793 Medcouple skewess:.64 3

39 Desity 1.e-5.e-5 3.e-5 4.e-5 5.e case Mi: st Qu.: Mea: Media: rd Qu.: Max: Total N: 9. Variace: Std Dev.: SE Mea: LCL Mea: UCL Mea: Skewess: 1.91 Kurtosis: Medcouple skewess:.187 Desity 5.e-5 1.e-4 1.5e-4.e case 3 Mi: st Qu.: Mea: Media: rd Qu.: Max: 7 Total N: 17 Variace: Std Dev.: SE Mea: LCL Mea: UCL Mea: Skewess: Kurtosis: Medcouple skewess:.11 Desity 5.e-5 1.e-4 1.5e-4.e case 4 Mi: st Qu.: Mea: Media: rd Qu.: Max: 16 Total N: 16 Variace: Std Dev.: SE Mea: LCL Mea: UCL Mea: Skewess: Kurtosis: Medcouple skewess:.119 Figure 1 (cotiued) 33

40 Table 6 shows the left, right, ad total umber of outliers idetified i each data set after applyig the five outlier labelig methods. Sample programs for Case 4 are give i APPENDIX E. Table 6: Iterval, Left, Right, ad Total Number of Outliers Accordig to the Five Outlier Methods Case 1 (Data set 1): N=9 Method Iterval Left Right Total SD Method (131.49, ) (.96) 6 (.87) 8 (3.83) 3 SD Method ( , ) () 1 (.48) 1 (.48) Tukey s Method (1.5 IQR) (376.81, ) 1 (.48) (.96) 3 (1.44) Tukey s Method (3 IQR) (-49.6, 17.3) () () () Adjusted Boxplot (95.41, ) 1 (.48) 1 (.48) (.96) MADe Method (131.5, 656.1) 4 (1.91) 11 (5.6) 15 (7.18) 3 MADe Method (74.1, 749.5) 1 (.48) (.96) 3 (1.44) Media Rule (-43.87, ) () 1 (.48) 1 (.48) Case (Data set ): N=9 Method Iterval Left Right Total SD Method ( , ) () 8 (3.83) 8 (3.83) 3 SD Method ( , ) () 4 (1.91) 4 (1.91) Tukey s Method (1.5 IQR) (169.66, ) () 8 (3.83) 8 (3.83) Tukey s Method (3 IQR) ( , ) () 3 (1.44) 3 (1.44) Adjusted Boxplot (858.85, ) 5 (.39) (.96) 7 (3.35) MADe Method ( , ) 4 (1.91) (9.57) 4 (11.48) 3 MADe Method ( , 488.4) () 6 (.87) 6 (.87) Media Rule ( , 4939.) () 5 (.39) 5 (.39) Case 3 (Data set 3): N=17 Method Iterval Left Right Total SD Method ( , ) () 1 (.79) 1 (.79) 3 SD Method ( , 53.38) () 1 (.79) 1 (.79) Tukey s Method (1.5 IQR) (-96.38, ) () 3 (.36) 3 (.36) Tukey s Method (3 IQR) ( , ) () (1.57) (1.57) Adjusted Boxplot ( , ) 1 (.79) (1.57) 3 (.36) MADe Method (114.7, ) 1 (.79) 6 (4.7) 7 (5.51) 3 MADe Method ( , ) () 3 (.36) 3 (.36) Media Rule ( , ) () 3 (.36) 3 (.36) 34

I. Chi-squared Distributions

I. Chi-squared Distributions 1 M 358K Supplemet to Chapter 23: CHI-SQUARED DISTRIBUTIONS, T-DISTRIBUTIONS, AND DEGREES OF FREEDOM To uderstad t-distributios, we first eed to look at aother family of distributios, the chi-squared distributios.

More information

Hypothesis testing. Null and alternative hypotheses

Hypothesis testing. Null and alternative hypotheses Hypothesis testig Aother importat use of samplig distributios is to test hypotheses about populatio parameters, e.g. mea, proportio, regressio coefficiets, etc. For example, it is possible to stipulate

More information

Measures of Spread and Boxplots Discrete Math, Section 9.4

Measures of Spread and Boxplots Discrete Math, Section 9.4 Measures of Spread ad Boxplots Discrete Math, Sectio 9.4 We start with a example: Example 1: Comparig Mea ad Media Compute the mea ad media of each data set: S 1 = {4, 6, 8, 10, 1, 14, 16} S = {4, 7, 9,

More information

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution

University of California, Los Angeles Department of Statistics. Distributions related to the normal distribution Uiversity of Califoria, Los Ageles Departmet of Statistics Statistics 100B Istructor: Nicolas Christou Three importat distributios: Distributios related to the ormal distributio Chi-square (χ ) distributio.

More information

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown

Z-TEST / Z-STATISTIC: used to test hypotheses about. µ when the population standard deviation is unknown Z-TEST / Z-STATISTIC: used to test hypotheses about µ whe the populatio stadard deviatio is kow ad populatio distributio is ormal or sample size is large T-TEST / T-STATISTIC: used to test hypotheses about

More information

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number.

GCSE STATISTICS. 4) How to calculate the range: The difference between the biggest number and the smallest number. GCSE STATISTICS You should kow: 1) How to draw a frequecy diagram: e.g. NUMBER TALLY FREQUENCY 1 3 5 ) How to draw a bar chart, a pictogram, ad a pie chart. 3) How to use averages: a) Mea - add up all

More information

PSYCHOLOGICAL STATISTICS

PSYCHOLOGICAL STATISTICS UNIVERSITY OF CALICUT SCHOOL OF DISTANCE EDUCATION B Sc. Cousellig Psychology (0 Adm.) IV SEMESTER COMPLEMENTARY COURSE PSYCHOLOGICAL STATISTICS QUESTION BANK. Iferetial statistics is the brach of statistics

More information

Determining the sample size

Determining the sample size Determiig the sample size Oe of the most commo questios ay statisticia gets asked is How large a sample size do I eed? Researchers are ofte surprised to fid out that the aswer depeds o a umber of factors

More information

Case Study. Normal and t Distributions. Density Plot. Normal Distributions

Case Study. Normal and t Distributions. Density Plot. Normal Distributions Case Study Normal ad t Distributios Bret Halo ad Bret Larget Departmet of Statistics Uiversity of Wiscosi Madiso October 11 13, 2011 Case Study Body temperature varies withi idividuals over time (it ca

More information

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights

Center, Spread, and Shape in Inference: Claims, Caveats, and Insights Ceter, Spread, ad Shape i Iferece: Claims, Caveats, ad Isights Dr. Nacy Pfeig (Uiversity of Pittsburgh) AMATYC November 2008 Prelimiary Activities 1. I would like to produce a iterval estimate for the

More information

1. C. The formula for the confidence interval for a population mean is: x t, which was

1. C. The formula for the confidence interval for a population mean is: x t, which was s 1. C. The formula for the cofidece iterval for a populatio mea is: x t, which was based o the sample Mea. So, x is guarateed to be i the iterval you form.. D. Use the rule : p-value

More information

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means)

CHAPTER 7: Central Limit Theorem: CLT for Averages (Means) CHAPTER 7: Cetral Limit Theorem: CLT for Averages (Meas) X = the umber obtaied whe rollig oe six sided die oce. If we roll a six sided die oce, the mea of the probability distributio is X P(X = x) Simulatio:

More information

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals

Overview. Learning Objectives. Point Estimate. Estimation. Estimating the Value of a Parameter Using Confidence Intervals Overview Estimatig the Value of a Parameter Usig Cofidece Itervals We apply the results about the sample mea the problem of estimatio Estimatio is the process of usig sample data estimate the value of

More information

Confidence Intervals for One Mean

Confidence Intervals for One Mean Chapter 420 Cofidece Itervals for Oe Mea Itroductio This routie calculates the sample size ecessary to achieve a specified distace from the mea to the cofidece limit(s) at a stated cofidece level for a

More information

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring No-life isurace mathematics Nils F. Haavardsso, Uiversity of Oslo ad DNB Skadeforsikrig Mai issues so far Why does isurace work? How is risk premium defied ad why is it importat? How ca claim frequecy

More information

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles

The following example will help us understand The Sampling Distribution of the Mean. C1 C2 C3 C4 C5 50 miles 84 miles 38 miles 120 miles 48 miles The followig eample will help us uderstad The Samplig Distributio of the Mea Review: The populatio is the etire collectio of all idividuals or objects of iterest The sample is the portio of the populatio

More information

5: Introduction to Estimation

5: Introduction to Estimation 5: Itroductio to Estimatio Cotets Acroyms ad symbols... 1 Statistical iferece... Estimatig µ with cofidece... 3 Samplig distributio of the mea... 3 Cofidece Iterval for μ whe σ is kow before had... 4 Sample

More information

Properties of MLE: consistency, asymptotic normality. Fisher information.

Properties of MLE: consistency, asymptotic normality. Fisher information. Lecture 3 Properties of MLE: cosistecy, asymptotic ormality. Fisher iformatio. I this sectio we will try to uderstad why MLEs are good. Let us recall two facts from probability that we be used ofte throughout

More information

1 Computing the Standard Deviation of Sample Means

1 Computing the Standard Deviation of Sample Means Computig the Stadard Deviatio of Sample Meas Quality cotrol charts are based o sample meas ot o idividual values withi a sample. A sample is a group of items, which are cosidered all together for our aalysis.

More information

Normal Distribution.

Normal Distribution. Normal Distributio www.icrf.l Normal distributio I probability theory, the ormal or Gaussia distributio, is a cotiuous probability distributio that is ofte used as a first approimatio to describe realvalued

More information

Chapter 7: Confidence Interval and Sample Size

Chapter 7: Confidence Interval and Sample Size Chapter 7: Cofidece Iterval ad Sample Size Learig Objectives Upo successful completio of Chapter 7, you will be able to: Fid the cofidece iterval for the mea, proportio, ad variace. Determie the miimum

More information

Output Analysis (2, Chapters 10 &11 Law)

Output Analysis (2, Chapters 10 &11 Law) B. Maddah ENMG 6 Simulatio 05/0/07 Output Aalysis (, Chapters 10 &11 Law) Comparig alterative system cofiguratio Sice the output of a simulatio is radom, the comparig differet systems via simulatio should

More information

A Test of Normality. 1 n S 2 3. n 1. Now introduce two new statistics. The sample skewness is defined as:

A Test of Normality. 1 n S 2 3. n 1. Now introduce two new statistics. The sample skewness is defined as: A Test of Normality Textbook Referece: Chapter. (eighth editio, pages 59 ; seveth editio, pages 6 6). The calculatio of p values for hypothesis testig typically is based o the assumptio that the populatio

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics We leared to describe data sets graphically. We ca also describe a data set umerically. Measures of Locatio Defiitio The sample mea is the arithmetic average of values. We deote

More information

Maximum Likelihood Estimators.

Maximum Likelihood Estimators. Lecture 2 Maximum Likelihood Estimators. Matlab example. As a motivatio, let us look at oe Matlab example. Let us geerate a radom sample of size 00 from beta distributio Beta(5, 2). We will lear the defiitio

More information

Exploratory Data Analysis

Exploratory Data Analysis 1 Exploratory Data Aalysis Exploratory data aalysis is ofte the rst step i a statistical aalysis, for it helps uderstadig the mai features of the particular sample that a aalyst is usig. Itelliget descriptios

More information

Quadrat Sampling in Population Ecology

Quadrat Sampling in Population Ecology Quadrat Samplig i Populatio Ecology Backgroud Estimatig the abudace of orgaisms. Ecology is ofte referred to as the "study of distributio ad abudace". This beig true, we would ofte like to kow how may

More information

Chapter 14 Nonparametric Statistics

Chapter 14 Nonparametric Statistics Chapter 14 Noparametric Statistics A.K.A. distributio-free statistics! Does ot deped o the populatio fittig ay particular type of distributio (e.g, ormal). Sice these methods make fewer assumptios, they

More information

One-sample test of proportions

One-sample test of proportions Oe-sample test of proportios The Settig: Idividuals i some populatio ca be classified ito oe of two categories. You wat to make iferece about the proportio i each category, so you draw a sample. Examples:

More information

Inference on Proportion. Chapter 8 Tests of Statistical Hypotheses. Sampling Distribution of Sample Proportion. Confidence Interval

Inference on Proportion. Chapter 8 Tests of Statistical Hypotheses. Sampling Distribution of Sample Proportion. Confidence Interval Chapter 8 Tests of Statistical Hypotheses 8. Tests about Proportios HT - Iferece o Proportio Parameter: Populatio Proportio p (or π) (Percetage of people has o health isurace) x Statistic: Sample Proportio

More information

Chapter 7 Methods of Finding Estimators

Chapter 7 Methods of Finding Estimators Chapter 7 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 011 Chapter 7 Methods of Fidig Estimators Sectio 7.1 Itroductio Defiitio 7.1.1 A poit estimator is ay fuctio W( X) W( X1, X,, X ) of

More information

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS

COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S 2 CONTROL CHART FOR THE CHANGES IN A PROCESS COMPARISON OF THE EFFICIENCY OF S-CONTROL CHART AND EWMA-S CONTROL CHART FOR THE CHANGES IN A PROCESS Supraee Lisawadi Departmet of Mathematics ad Statistics, Faculty of Sciece ad Techoology, Thammasat

More information

This document contains a collection of formulas and constants useful for SPC chart construction. It assumes you are already familiar with SPC.

This document contains a collection of formulas and constants useful for SPC chart construction. It assumes you are already familiar with SPC. SPC Formulas ad Tables 1 This documet cotais a collectio of formulas ad costats useful for SPC chart costructio. It assumes you are already familiar with SPC. Termiology Geerally, a bar draw over a symbol

More information

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the.

Confidence Intervals. CI for a population mean (σ is known and n > 30 or the variable is normally distributed in the. Cofidece Itervals A cofidece iterval is a iterval whose purpose is to estimate a parameter (a umber that could, i theory, be calculated from the populatio, if measuremets were available for the whole populatio).

More information

Lesson 17 Pearson s Correlation Coefficient

Lesson 17 Pearson s Correlation Coefficient Outlie Measures of Relatioships Pearso s Correlatio Coefficiet (r) -types of data -scatter plots -measure of directio -measure of stregth Computatio -covariatio of X ad Y -uique variatio i X ad Y -measurig

More information

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008

In nite Sequences. Dr. Philippe B. Laval Kennesaw State University. October 9, 2008 I ite Sequeces Dr. Philippe B. Laval Keesaw State Uiversity October 9, 2008 Abstract This had out is a itroductio to i ite sequeces. mai de itios ad presets some elemetary results. It gives the I ite Sequeces

More information

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES

SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES SECTION 1.5 : SUMMATION NOTATION + WORK WITH SEQUENCES Read Sectio 1.5 (pages 5 9) Overview I Sectio 1.5 we lear to work with summatio otatio ad formulas. We will also itroduce a brief overview of sequeces,

More information

1 Correlation and Regression Analysis

1 Correlation and Regression Analysis 1 Correlatio ad Regressio Aalysis I this sectio we will be ivestigatig the relatioship betwee two cotiuous variable, such as height ad weight, the cocetratio of a ijected drug ad heart rate, or the cosumptio

More information

0.7 0.6 0.2 0 0 96 96.5 97 97.5 98 98.5 99 99.5 100 100.5 96.5 97 97.5 98 98.5 99 99.5 100 100.5

0.7 0.6 0.2 0 0 96 96.5 97 97.5 98 98.5 99 99.5 100 100.5 96.5 97 97.5 98 98.5 99 99.5 100 100.5 Sectio 13 Kolmogorov-Smirov test. Suppose that we have a i.i.d. sample X 1,..., X with some ukow distributio P ad we would like to test the hypothesis that P is equal to a particular distributio P 0, i.e.

More information

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n

THE REGRESSION MODEL IN MATRIX FORM. For simple linear regression, meaning one predictor, the model is. for i = 1, 2, 3,, n We will cosider the liear regressio model i matrix form. For simple liear regressio, meaig oe predictor, the model is i = + x i + ε i for i =,,,, This model icludes the assumptio that the ε i s are a sample

More information

Math C067 Sampling Distributions

Math C067 Sampling Distributions Math C067 Samplig Distributios Sample Mea ad Sample Proportio Richard Beigel Some time betwee April 16, 2007 ad April 16, 2007 Examples of Samplig A pollster may try to estimate the proportio of voters

More information

Confidence Intervals

Confidence Intervals Cofidece Itervals Cofidece Itervals are a extesio of the cocept of Margi of Error which we met earlier i this course. Remember we saw: The sample proportio will differ from the populatio proportio by more

More information

Statistical inference: example 1. Inferential Statistics

Statistical inference: example 1. Inferential Statistics Statistical iferece: example 1 Iferetial Statistics POPULATION SAMPLE A clothig store chai regularly buys from a supplier large quatities of a certai piece of clothig. Each item ca be classified either

More information

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method

Chapter 6: Variance, the law of large numbers and the Monte-Carlo method Chapter 6: Variace, the law of large umbers ad the Mote-Carlo method Expected value, variace, ad Chebyshev iequality. If X is a radom variable recall that the expected value of X, E[X] is the average value

More information

Data Analysis and Statistical Behaviors of Stock Market Fluctuations

Data Analysis and Statistical Behaviors of Stock Market Fluctuations 44 JOURNAL OF COMPUTERS, VOL. 3, NO. 0, OCTOBER 2008 Data Aalysis ad Statistical Behaviors of Stock Market Fluctuatios Ju Wag Departmet of Mathematics, Beijig Jiaotog Uiversity, Beijig 00044, Chia Email:

More information

CONTROL CHART BASED ON A MULTIPLICATIVE-BINOMIAL DISTRIBUTION

CONTROL CHART BASED ON A MULTIPLICATIVE-BINOMIAL DISTRIBUTION www.arpapress.com/volumes/vol8issue2/ijrras_8_2_04.pdf CONTROL CHART BASED ON A MULTIPLICATIVE-BINOMIAL DISTRIBUTION Elsayed A. E. Habib Departmet of Statistics ad Mathematics, Faculty of Commerce, Beha

More information

Mann-Whitney U 2 Sample Test (a.k.a. Wilcoxon Rank Sum Test)

Mann-Whitney U 2 Sample Test (a.k.a. Wilcoxon Rank Sum Test) No-Parametric ivariate Statistics: Wilcoxo-Ma-Whitey 2 Sample Test 1 Ma-Whitey 2 Sample Test (a.k.a. Wilcoxo Rak Sum Test) The (Wilcoxo-) Ma-Whitey (WMW) test is the o-parametric equivalet of a pooled

More information

A probabilistic proof of a binomial identity

A probabilistic proof of a binomial identity A probabilistic proof of a biomial idetity Joatho Peterso Abstract We give a elemetary probabilistic proof of a biomial idetity. The proof is obtaied by computig the probability of a certai evet i two

More information

Biology 171L Environment and Ecology Lab Lab 2: Descriptive Statistics, Presenting Data and Graphing Relationships

Biology 171L Environment and Ecology Lab Lab 2: Descriptive Statistics, Presenting Data and Graphing Relationships Biology 171L Eviromet ad Ecology Lab Lab : Descriptive Statistics, Presetig Data ad Graphig Relatioships Itroductio Log lists of data are ofte ot very useful for idetifyig geeral treds i the data or the

More information

Research Method (I) --Knowledge on Sampling (Simple Random Sampling)

Research Method (I) --Knowledge on Sampling (Simple Random Sampling) Research Method (I) --Kowledge o Samplig (Simple Radom Samplig) 1. Itroductio to samplig 1.1 Defiitio of samplig Samplig ca be defied as selectig part of the elemets i a populatio. It results i the fact

More information

Overview of some probability distributions.

Overview of some probability distributions. Lecture Overview of some probability distributios. I this lecture we will review several commo distributios that will be used ofte throughtout the class. Each distributio is usually described by its probability

More information

Lesson 15 ANOVA (analysis of variance)

Lesson 15 ANOVA (analysis of variance) Outlie Variability -betwee group variability -withi group variability -total variability -F-ratio Computatio -sums of squares (betwee/withi/total -degrees of freedom (betwee/withi/total -mea square (betwee/withi

More information

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN

Analyzing Longitudinal Data from Complex Surveys Using SUDAAN Aalyzig Logitudial Data from Complex Surveys Usig SUDAAN Darryl Creel Statistics ad Epidemiology, RTI Iteratioal, 312 Trotter Farm Drive, Rockville, MD, 20850 Abstract SUDAAN: Software for the Statistical

More information

Definition. A variable X that takes on values X 1, X 2, X 3,...X k with respective frequencies f 1, f 2, f 3,...f k has mean

Definition. A variable X that takes on values X 1, X 2, X 3,...X k with respective frequencies f 1, f 2, f 3,...f k has mean 1 Social Studies 201 October 13, 2004 Note: The examples i these otes may be differet tha used i class. However, the examples are similar ad the methods used are idetical to what was preseted i class.

More information

hp calculators HP 12C Statistics - average and standard deviation Average and standard deviation concepts HP12C average and standard deviation

hp calculators HP 12C Statistics - average and standard deviation Average and standard deviation concepts HP12C average and standard deviation HP 1C Statistics - average ad stadard deviatio Average ad stadard deviatio cocepts HP1C average ad stadard deviatio Practice calculatig averages ad stadard deviatios with oe or two variables HP 1C Statistics

More information

LECTURE 13: Cross-validation

LECTURE 13: Cross-validation LECTURE 3: Cross-validatio Resampli methods Cross Validatio Bootstrap Bias ad variace estimatio with the Bootstrap Three-way data partitioi Itroductio to Patter Aalysis Ricardo Gutierrez-Osua Texas A&M

More information

Incremental calculation of weighted mean and variance

Incremental calculation of weighted mean and variance Icremetal calculatio of weighted mea ad variace Toy Fich faf@cam.ac.uk dot@dotat.at Uiversity of Cambridge Computig Service February 009 Abstract I these otes I eplai how to derive formulae for umerically

More information

, a Wishart distribution with n -1 degrees of freedom and scale matrix.

, a Wishart distribution with n -1 degrees of freedom and scale matrix. UMEÅ UNIVERSITET Matematisk-statistiska istitutioe Multivariat dataaalys D MSTD79 PA TENTAMEN 004-0-9 LÖSNINGSFÖRSLAG TILL TENTAMEN I MATEMATISK STATISTIK Multivariat dataaalys D, 5 poäg.. Assume that

More information

Institute of Actuaries of India Subject CT1 Financial Mathematics

Institute of Actuaries of India Subject CT1 Financial Mathematics Istitute of Actuaries of Idia Subject CT1 Fiacial Mathematics For 2014 Examiatios Subject CT1 Fiacial Mathematics Core Techical Aim The aim of the Fiacial Mathematics subject is to provide a groudig i

More information

INVESTMENT PERFORMANCE COUNCIL (IPC)

INVESTMENT PERFORMANCE COUNCIL (IPC) INVESTMENT PEFOMANCE COUNCIL (IPC) INVITATION TO COMMENT: Global Ivestmet Performace Stadards (GIPS ) Guidace Statemet o Calculatio Methodology The Associatio for Ivestmet Maagemet ad esearch (AIM) seeks

More information

Predictive Modeling Data. in the ACT Electronic Student Record

Predictive Modeling Data. in the ACT Electronic Student Record Predictive Modelig Data i the ACT Electroic Studet Record overview Predictive Modelig Data Added to the ACT Electroic Studet Record With the release of studet records i September 2012, predictive modelig

More information

Soving Recurrence Relations

Soving Recurrence Relations Sovig Recurrece Relatios Part 1. Homogeeous liear 2d degree relatios with costat coefficiets. Cosider the recurrece relatio ( ) T () + at ( 1) + bt ( 2) = 0 This is called a homogeeous liear 2d degree

More information

Sampling Distribution And Central Limit Theorem

Sampling Distribution And Central Limit Theorem () Samplig Distributio & Cetral Limit Samplig Distributio Ad Cetral Limit Samplig distributio of the sample mea If we sample a umber of samples (say k samples where k is very large umber) each of size,

More information

Present Values, Investment Returns and Discount Rates

Present Values, Investment Returns and Discount Rates Preset Values, Ivestmet Returs ad Discout Rates Dimitry Midli, ASA, MAAA, PhD Presidet CDI Advisors LLC dmidli@cdiadvisors.com May 2, 203 Copyright 20, CDI Advisors LLC The cocept of preset value lies

More information

Basic Data Analysis Principles. Acknowledgments

Basic Data Analysis Principles. Acknowledgments CEB - Basic Data Aalysis Priciples Basic Data Aalysis Priciples What to do oce you get the data Whe we reaso about quatitative evidece, certai methods for displayig ad aalyzig data are better tha others.

More information

Convexity, Inequalities, and Norms

Convexity, Inequalities, and Norms Covexity, Iequalities, ad Norms Covex Fuctios You are probably familiar with the otio of cocavity of fuctios. Give a twicedifferetiable fuctio ϕ: R R, We say that ϕ is covex (or cocave up) if ϕ (x) 0 for

More information

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth

.04. This means $1000 is multiplied by 1.02 five times, once for each of the remaining sixmonth Questio 1: What is a ordiary auity? Let s look at a ordiary auity that is certai ad simple. By this, we mea a auity over a fixed term whose paymet period matches the iterest coversio period. Additioally,

More information

Chapter XIV: Fundamentals of Probability and Statistics *

Chapter XIV: Fundamentals of Probability and Statistics * Objectives Chapter XIV: Fudametals o Probability ad Statistics * Preset udametal cocepts o probability ad statistics Review measures o cetral tedecy ad dispersio Aalyze methods ad applicatios o descriptive

More information

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find

Approximating Area under a curve with rectangles. To find the area under a curve we approximate the area using rectangles and then use limits to find 1.8 Approximatig Area uder a curve with rectagles 1.6 To fid the area uder a curve we approximate the area usig rectagles ad the use limits to fid 1.4 the area. Example 1 Suppose we wat to estimate 1.

More information

MEI Structured Mathematics. Module Summary Sheets. Statistics 2 (Version B: reference to new book)

MEI Structured Mathematics. Module Summary Sheets. Statistics 2 (Version B: reference to new book) MEI Mathematics i Educatio ad Idustry MEI Structured Mathematics Module Summary Sheets Statistics (Versio B: referece to ew book) Topic : The Poisso Distributio Topic : The Normal Distributio Topic 3:

More information

Now here is the important step

Now here is the important step LINEST i Excel The Excel spreadsheet fuctio "liest" is a complete liear least squares curve fittig routie that produces ucertaity estimates for the fit values. There are two ways to access the "liest"

More information

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas:

Chapter 7 - Sampling Distributions. 1 Introduction. What is statistics? It consist of three major areas: Chapter 7 - Samplig Distributios 1 Itroductio What is statistics? It cosist of three major areas: Data Collectio: samplig plas ad experimetal desigs Descriptive Statistics: umerical ad graphical summaries

More information

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM

PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM PROCEEDINGS OF THE YEREVAN STATE UNIVERSITY Physical ad Mathematical Scieces 2015, 1, p. 15 19 M a t h e m a t i c s AN ALTERNATIVE MODEL FOR BONUS-MALUS SYSTEM A. G. GULYAN Chair of Actuarial Mathematics

More information

Modified Line Search Method for Global Optimization

Modified Line Search Method for Global Optimization Modified Lie Search Method for Global Optimizatio Cria Grosa ad Ajith Abraham Ceter of Excellece for Quatifiable Quality of Service Norwegia Uiversity of Sciece ad Techology Trodheim, Norway {cria, ajith}@q2s.tu.o

More information

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return

where: T = number of years of cash flow in investment's life n = the year in which the cash flow X n i = IRR = the internal rate of return EVALUATING ALTERNATIVE CAPITAL INVESTMENT PROGRAMS By Ke D. Duft, Extesio Ecoomist I the March 98 issue of this publicatio we reviewed the procedure by which a capital ivestmet project was assessed. The

More information

Hypergeometric Distributions

Hypergeometric Distributions 7.4 Hypergeometric Distributios Whe choosig the startig lie-up for a game, a coach obviously has to choose a differet player for each positio. Similarly, whe a uio elects delegates for a covetio or you

More information

CHAPTER 3 THE TIME VALUE OF MONEY

CHAPTER 3 THE TIME VALUE OF MONEY CHAPTER 3 THE TIME VALUE OF MONEY OVERVIEW A dollar i the had today is worth more tha a dollar to be received i the future because, if you had it ow, you could ivest that dollar ad ear iterest. Of all

More information

Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE 6.44. The absolute value of the complex number z a bi is

Trigonometric Form of a Complex Number. The Complex Plane. axis. ( 2, 1) or 2 i FIGURE 6.44. The absolute value of the complex number z a bi is 0_0605.qxd /5/05 0:45 AM Page 470 470 Chapter 6 Additioal Topics i Trigoometry 6.5 Trigoometric Form of a Complex Number What you should lear Plot complex umbers i the complex plae ad fid absolute values

More information

Lecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009)

Lecture 13. Lecturer: Jonathan Kelner Scribe: Jonathan Pines (2009) 18.409 A Algorithmist s Toolkit October 27, 2009 Lecture 13 Lecturer: Joatha Keler Scribe: Joatha Pies (2009) 1 Outlie Last time, we proved the Bru-Mikowski iequality for boxes. Today we ll go over the

More information

3. Greatest Common Divisor - Least Common Multiple

3. Greatest Common Divisor - Least Common Multiple 3 Greatest Commo Divisor - Least Commo Multiple Defiitio 31: The greatest commo divisor of two atural umbers a ad b is the largest atural umber c which divides both a ad b We deote the greatest commo gcd

More information

Section 11.3: The Integral Test

Section 11.3: The Integral Test Sectio.3: The Itegral Test Most of the series we have looked at have either diverged or have coverged ad we have bee able to fid what they coverge to. I geeral however, the problem is much more difficult

More information

CHAPTER 3 DIGITAL CODING OF SIGNALS

CHAPTER 3 DIGITAL CODING OF SIGNALS CHAPTER 3 DIGITAL CODING OF SIGNALS Computers are ofte used to automate the recordig of measuremets. The trasducers ad sigal coditioig circuits produce a voltage sigal that is proportioal to a quatity

More information

Practice Problems for Test 3

Practice Problems for Test 3 Practice Problems for Test 3 Note: these problems oly cover CIs ad hypothesis testig You are also resposible for kowig the samplig distributio of the sample meas, ad the Cetral Limit Theorem Review all

More information

Page 1. Real Options for Engineering Systems. What are we up to? Today s agenda. J1: Real Options for Engineering Systems. Richard de Neufville

Page 1. Real Options for Engineering Systems. What are we up to? Today s agenda. J1: Real Options for Engineering Systems. Richard de Neufville Real Optios for Egieerig Systems J: Real Optios for Egieerig Systems By (MIT) Stefa Scholtes (CU) Course website: http://msl.mit.edu/cmi/ardet_2002 Stefa Scholtes Judge Istitute of Maagemet, CU Slide What

More information

Multi-server Optimal Bandwidth Monitoring for QoS based Multimedia Delivery Anup Basu, Irene Cheng and Yinzhe Yu

Multi-server Optimal Bandwidth Monitoring for QoS based Multimedia Delivery Anup Basu, Irene Cheng and Yinzhe Yu Multi-server Optimal Badwidth Moitorig for QoS based Multimedia Delivery Aup Basu, Iree Cheg ad Yizhe Yu Departmet of Computig Sciece U. of Alberta Architecture Applicatio Layer Request receptio -coectio

More information

Theorems About Power Series

Theorems About Power Series Physics 6A Witer 20 Theorems About Power Series Cosider a power series, f(x) = a x, () where the a are real coefficiets ad x is a real variable. There exists a real o-egative umber R, called the radius

More information

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments

Project Deliverables. CS 361, Lecture 28. Outline. Project Deliverables. Administrative. Project Comments Project Deliverables CS 361, Lecture 28 Jared Saia Uiversity of New Mexico Each Group should tur i oe group project cosistig of: About 6-12 pages of text (ca be loger with appedix) 6-12 figures (please

More information

OMG! Excessive Texting Tied to Risky Teen Behaviors

OMG! Excessive Texting Tied to Risky Teen Behaviors BUSIESS WEEK: EXECUTIVE EALT ovember 09, 2010 OMG! Excessive Textig Tied to Risky Tee Behaviors Kids who sed more tha 120 a day more likely to try drugs, alcohol ad sex, researchers fid TUESDAY, ov. 9

More information

How to read A Mutual Fund shareholder report

How to read A Mutual Fund shareholder report Ivestor BulletI How to read A Mutual Fud shareholder report The SEC s Office of Ivestor Educatio ad Advocacy is issuig this Ivestor Bulleti to educate idividual ivestors about mutual fud shareholder reports.

More information

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx

SAMPLE QUESTIONS FOR FINAL EXAM. (1) (2) (3) (4) Find the following using the definition of the Riemann integral: (2x + 1)dx SAMPLE QUESTIONS FOR FINAL EXAM REAL ANALYSIS I FALL 006 3 4 Fid the followig usig the defiitio of the Riema itegral: a 0 x + dx 3 Cosider the partitio P x 0 3, x 3 +, x 3 +,......, x 3 3 + 3 of the iterval

More information

PENSION ANNUITY. Policy Conditions Document reference: PPAS1(7) This is an important document. Please keep it in a safe place.

PENSION ANNUITY. Policy Conditions Document reference: PPAS1(7) This is an important document. Please keep it in a safe place. PENSION ANNUITY Policy Coditios Documet referece: PPAS1(7) This is a importat documet. Please keep it i a safe place. Pesio Auity Policy Coditios Welcome to LV=, ad thak you for choosig our Pesio Auity.

More information

Sequences and Series

Sequences and Series CHAPTER 9 Sequeces ad Series 9.. Covergece: Defiitio ad Examples Sequeces The purpose of this chapter is to itroduce a particular way of geeratig algorithms for fidig the values of fuctios defied by their

More information

THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction

THE ARITHMETIC OF INTEGERS. - multiplication, exponentiation, division, addition, and subtraction THE ARITHMETIC OF INTEGERS - multiplicatio, expoetiatio, divisio, additio, ad subtractio What to do ad what ot to do. THE INTEGERS Recall that a iteger is oe of the whole umbers, which may be either positive,

More information

Irreducible polynomials with consecutive zero coefficients

Irreducible polynomials with consecutive zero coefficients Irreducible polyomials with cosecutive zero coefficiets Theodoulos Garefalakis Departmet of Mathematics, Uiversity of Crete, 71409 Heraklio, Greece Abstract Let q be a prime power. We cosider the problem

More information

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling

Taking DCOP to the Real World: Efficient Complete Solutions for Distributed Multi-Event Scheduling Taig DCOP to the Real World: Efficiet Complete Solutios for Distributed Multi-Evet Schedulig Rajiv T. Maheswara, Milid Tambe, Emma Bowrig, Joatha P. Pearce, ad Pradeep araatham Uiversity of Souther Califoria

More information

Basic Elements of Arithmetic Sequences and Series

Basic Elements of Arithmetic Sequences and Series MA40S PRE-CALCULUS UNIT G GEOMETRIC SEQUENCES CLASS NOTES (COMPLETED NO NEED TO COPY NOTES FROM OVERHEAD) Basic Elemets of Arithmetic Sequeces ad Series Objective: To establish basic elemets of arithmetic

More information

ODBC. Getting Started With Sage Timberline Office ODBC

ODBC. Getting Started With Sage Timberline Office ODBC ODBC Gettig Started With Sage Timberlie Office ODBC NOTICE This documet ad the Sage Timberlie Office software may be used oly i accordace with the accompayig Sage Timberlie Office Ed User Licese Agreemet.

More information

4.3. The Integral and Comparison Tests

4.3. The Integral and Comparison Tests 4.3. THE INTEGRAL AND COMPARISON TESTS 9 4.3. The Itegral ad Compariso Tests 4.3.. The Itegral Test. Suppose f is a cotiuous, positive, decreasig fuctio o [, ), ad let a = f(). The the covergece or divergece

More information

PUBLIC RELATIONS PROJECT 2016

PUBLIC RELATIONS PROJECT 2016 PUBLIC RELATIONS PROJECT 2016 The purpose of the Public Relatios Project is to provide a opportuity for the chapter members to demostrate the kowledge ad skills eeded i plaig, orgaizig, implemetig ad evaluatig

More information

Inverse Gaussian Distribution

Inverse Gaussian Distribution 5 Kauhisa Matsuda All rights reserved. Iverse Gaussia Distributio Abstract Kauhisa Matsuda Departmet of Ecoomics The Graduate Ceter The City Uiversity of New York 65 Fifth Aveue New York NY 6-49 Email:

More information