1 Exploratory Data Aalysis Exploratory data aalysis is ofte the rst step i a statistical aalysis, for it helps uderstadig the mai features of the particular sample that a aalyst is usig. Itelliget descriptios or summaries of the data may sometimes be su±ciet to ful ll the purposes for which the data were gathered. E ective summaries ca also poit to \bad" data or uexpected aspects that might go uoticed if data are blidly cruched by computers. Further, exploratory data aalysis suggest possible probability models for the data ad helps uderstadig the populatio features that a good model ought to be able to reproduce. Here we shall brie y discuss ways of summarizig three features of the distributio of a batch of data Z = (Z 1 ;:::;Z ): its ceter, its spread ad is shape. All three cocepts are deliberately kept vague. For further refereces see Hoagli et al. (1984) ad Mosteller ad Tukey (1977). The term batch is used to emphasize the fact that at this stage o commitmet to a statistical model is beig made. 1.1 MEASURES OF CENTER A very popular measure of ceter (or locatio) is the (arithmetic mea X ¹Z = 1 Z i = 1 > Z; where deotes a -vector of oes. Notice that the mea eed ot coicide with ay of the observatios i the batch. The use of the mea is partly justi ed by its liearity property, that is, if X ad Y are batches of data of equal size ad Z is such that Z i = ¹ + X i + Y i, the ¹Z = ¹ + ¹ X + ¹Y : Notice however that if Z i = g(x i ;Y i ), where g is a arbitrary fuctio, the it is geerally ot true that ¹ Z = g( ¹ X; ¹ Y ). The mea is very sesitive to addig ad droppig observatios. I particular, it is very sesitive to eve a sigle outlier, that is, a arbitrarily large or small data poit. To see this, let ¹ Z 1 deote the mea of a batch of 1 observatios.
2 The mea of the batch of observatios obtaied by addig the value z to the iitial batch is equal to 1 X ¹Z = 1 ( Z i + z) = µ 1 1 ¹Z 1 + 1 z; that is, ¹Z is a weighted average of ¹Z 1 ad z. For ay xed, j ¹Z ¹Z 1 j = jz ¹Z 1 j! 1; as jzj! 1. Sice a sigle outlier is eough to take ¹Z arbitrarily away from ¹Z 1, we say that the mea is ot a robust measure of ceter. O the other had, for ay xed ite z, ¹Z ¹Z 1 = z ¹Z 1! 0; as! 1, ad so the e ect of a sigle outlier vaishes as the size of the batch gets arbitrarily large. The ormalized di erece SC(z) = ( ¹Z ¹Z 1 ) = z ¹Z 1 ; viewed as fuctio of z, is called the sesitivity curve of the mea. The fact that this fuctio is ubouded simply re ects the lack of robustess of the mea. A closely related cocept is J.W. Tukey's empirical i uece fuctio. Let ¹Z deote the mea of a batch of data of size, ad let ¹Z (i) deote the mea of the batch of size 1 obtaied by deletig the i-th data poit Z i. It is easy to verify that ¹Z ¹Z (i) = Z i ¹Z 1 ; i = 1;:::;: The empirical i uece fuctio of the mea is a -vector with ith elemet equal to ¹Z ¹Z (i). A i uetial observatio is oe for which the di erece j ¹Z ¹Z (i) j is large or, equivaletly, the residual Z i ¹Z is large. To robustify the mea, let us sort the data i ascedig order. The ordered data Z (1) ;Z (2) ;:::;Z (), where Z (1) Z (2) Z (), are called the set of order statistics of the batch. A reasoable measure of ceter is the (symmetric) -trimmed mea, de ed as ¹Z = Z ([ ]+1) + + Z ( [ ]) ; 0 < :5; 2[ ] where [ ] deotes the greatest iteger less tha or equal to. Thus ¹ Z is obtaied by droppig the [ ] largest ad [ ] smallest data poits ad the takig the average of the rest. The mea is the extreme case correspodig to = 0.
EXPLORATORY DATA ANALYSIS 3 To compare the robustess properties of the mea ad a -trimmed mea, we itroduce the cocept of breakdow poit. Let T(Z) be a measure of ceter for a batch Z of size, ad let T(Z ) be the same measure for a ew batch Z obtaied by replacig ay m of the the origial data poits by arbitrary values. Let b(m;t; Z) = supjt(z ) T(Z)j; Z where the supremum is take over all possible Z. If b(m; Z; Z ) is i ite, this meas that m outliers ca have a arbitrarily large e ect o T, which may be expressed by sayig that T \breaks dow". Therefore, the breakdow poit of T is de ed by h m i ²(T; Z) = mi : b(m;t; Z) = 1 : I other words, the breakdow poit is the smallest fractio of cotamiatio that ca cause T(Z ) to take o values arbitrarily far from T(Z). It is straightforward to verify that the breakdow poit of the mea is equal to 1=, whereas the breakdow poit of the -trimmed mea is equal to ([ ] + 1)=. The media may be viewed as the extreme case of a -trimmed mea correspodig to! :5. Whe the umber of data poits i Z is odd, the media ~Z is uique ad is equal to Z ([+1]=2). Whe is eve, a media is ay poit i the iterval [Z (=2) ; Z ([=2]+1) ]. This lack of uiqueess is covetioally resolved by de ig ½ Z([+1]=2) ; ~Z = if is odd, :5[Z (=2) + Z ([=2]+1) ]; if is eve. Notice that if is odd, the media exactly coicides with oe of the observatios. If is eve, the media is the average of two adjacet order statistics. It is easy to verify that if g is ay icreasig fuctio ad X is such that X i = g(z i ), the ~X = g( ~Z). The breakdow poit of the media is equal to 1/2 if is eve, ad is equal to (1 + 1 )=2 if is odd. With little loss of geerality, let ~Z 1 be the media of a batch of size 1, where 1 = 2k is eve. Thus, ~Z 1 = :5[Z (k) +Z (k+1) ]. The media of the batch of size obtaied by addig the value z to the previous batch is equal to 8 < Z (k) ; if z < Z (k), ~Z = z; if Z (k) z Z (k+1), : Z (k+1) ; if z > Z (k+1). To compare the sesitivity curves of the mea ad the media, cosider the case whe ¹ Z 1 = ~ Z 1. The while SC(z; ¹ Z) = z ¹ Z 1 ; 8 SC(z; Z) ~ < (Z (k) Z ¹ 1 ); if z < Z (k), = (z Z : ¹ 1 ); if Z (k) z Z (k+1), (Z (k+1) Z ¹ 1 ); if z > Z (k+1).
4 Istead of choosig a sigle measure of ceter, it is ofte more iformative to compute ad compare several measures. For example, comparig the mea ad the media gives idicatio about the presece of skeweess i the data (skewess is aother vague cocept!). If the data are symmetric, the the mea ad the media coicide. If the data are skewed to the left, the the mea is greater tha the media. If the data are skewed to the right, the the media is greater tha the mea. 1.2 MEASURES OF SPREAD Two measures of spread (or scale) based o order statistics are the rage rage = maxfz i g mifz i g = Z () Z (1) i i ad the iterquartile rage IQR = upper quartile - lower quartile, where the upper quartile is the media of the data greater or equal to the media, ad the lower quartile is the media of the data smaller or equal to the media. Two other commo measures of spread are the mea squared deviatio from the mea ^¾ 2 = 1 X (Z i ¹Z) 2 ; or its square root ^¾ called the stadard deviatio, ad the mea absolute deviatio from the mea X ~¾ = 1 jz i ¹Zj: The rst is just the mea of the squared deviatios (Z i ¹Z) 2, while the secod is the mea of the absolute deviatios jz i ¹Zj. Because of their mea-like ature, either measure is robust. It is easily see that X ^¾ 2 = 1 Zi 2 ¹Z 2 : Further, if X is such that X i = a + bz i, b 6= 0, the ^¾ X = jbj ^¾ Z ; ~¾ X = jbj ~¾ Z : A highly robust estimate of spread is the media absolute deviatio from the media MAD = Med i jz i ~ Zj: 1.3 MEASURES OF SHAPE Oe measure of ceter ad oe measure of spread are ofte all oe eeds to cocisely summarize the data. Just a pair of summary statistics, however, does
EXPLORATORY DATA ANALYSIS 5 ot provide a accurate descriptio of the data, i the sese that arbitrarily di eret batches of data may result i exactly the same descriptio. J.W. Tukey suggested the use of a box-plot, a graphical procedure that combies a measure of locatio (the media), a measure of spread (the IQR), shows the presece of possible outliers, ad gives some idicatio about the shape of the distributio of the data i terms of their symmetry or skewess. Costructio of a box-plot proceeds as follows: 1. Horizotal lies are draw at the media ad the upper ad lower quartiles are joied by vertical lies to produce the box. 2. Vertical lie is draw up from the upper quartile to the most extreme data poit that is withi a distace of 1:5 IQR from the upper quartile. A similarly de ed vertical lie is draw dow from the lower quartile. Short horizotal lies are added to mark the eds of these vertical lies. 3. Each data poit beyod the eds of the vertical lie is marked with a asterix or a dot. Symmetry or asymmetry is revealed by the locatio of the media relative to the upper ad lower quartiles. If a large batch of data is available, oe ca study its shape i more detail. The mai tool is the empirical distributio fuctio (edf) F (z), de ed as the fractio of data poits less tha or equal to z. Let 1fAg deote the idicator fuctio of the evet A, that is, ½ 1; if A occurs, 1fAg = 0; otherwise. The we ca simply write F (z) = 1 X 1fZ i zg: Notice that F is a o-decreasig step fuctio, bouded betwee 0 ad 1, with jumps of height 1= at each distict poit Z i. If a data value is repeated m times, the jump is equal to m=. The edf. summarizes all the iformatio cotaied i a batch of data, except the order i which the observatios eter the batch. Notice that i some cases, such as time-series, time order may be importat. There exists a simple relatioship betwee the edf ad the set of order statistics. By the de itio of order statistic, the umber of data poits less tha or equal to Z (i) is equal to i. Thus F (Z (i) ) = i ; ad we say that Z (i) is the i=-quatile of the empirical distributio of Z. Sometimes it is useful to compare two edf's by meas of Q{Q plots. I a Q{Q plot, the quatiles of oe batch of data are plottet agaist those of aother.
6 To iterpret a Q{Q plot, the followig result is useful. If X ad Z are batches of data such that X i = a + bz i, 0 < b < 1, the Z (i) = a + bz (i). This implies that a Q{Q plot of X ad Z is a straight lie with slope equal to b ad itercept equal to a. Istead of workig with the edf, it is sometimes coveiet to work with a equivalet represetatio, amely the empirical survival fuctio S (z) = 1 X 1fZ i > zg = 1 F (z): This is just the fractio of data poits greater tha z. Clearly, S (Z (i) ) = ( i)=. The empirical survival fuctio is ofte used i the case of data o time util failure or death, such as idividual life-times or uemploymet duratio data. A alterative way of displayig the shape of a batch of data is by meas of a histogram. To costruct a histogram, partitio the rage of the data ito itervals or bis of a certai (possibly uequal) bi width. A histogram is the obtaied by plottig the fractio of observatios i each bi divided by the bi width. Thus, if the bi width is costat ad equal to ±, the height h (z;±) of a histogram at a poit z is equal to the umber of data poits i the bi cotaiig z divided by ±. Thus, ±h (y;±) is just the relative frequecy of data i the same bi cotaiig z. If there are m bis of equal size, the ± = Z () Z (1) : m Viewed as a fuctio, h ( ;±) is o-egative ad itegrates up to oe, that is, h ( ;±) 0 ad R h (z;±)dz = 1. A crucial problem i costructig a histogram is the choice of the umber of bis. Too may bis (or, equivaletly if the bi width is costat, too small a bi width) make a histogram look too ragged, too few bis (too large a bi width) make the histogram look oversmoothed. If data are ot uiformly distributed, it may be useful to let the biwidth vary with the local desity of the data. I this case, wider bis will be chose where the data are more sparse, ad arrower bis where the data are more dese. REFERENCES Hoagli D.C., Mosteller F. ad Tukey J.W. (1983) Uderstadig Robust ad Exploratory Data Aalysis, Wiley, New York. Mosteller F. ad Tukey J.W. (1977) Data Aalysis ad Regressio: A Secod Course i Statistics, Addiso-Wesley, Readig, MA. Tukey J. (1977) Exploratory Data Aalysis, Addiso-Wesley, Readig, MA.