Statistical Geophysics WS 2008/09 7..2008 Christian Heumann und Helmut Küchenhoff Measures for location and dispersion of a sample Measures for location and dispersion of a sample In the following: Variable X Sample of size n Sample: x, x 2,..., x n Summary measures for the location or dispersion of empirical distributions. Measures for location Measures for location or central tendency Definition: Mode The most frequently occurring value or category of X Example: Mode of (2, 2, 3, 3, 4, 5, 5, 5) is 5 Perhaps not well defined: (2, 2, 2, 3, 4, 4, 4) has two modes Measures for location or central tendency Definition: Median Order the sample to get the Order Statistics: x, x 2,..., x n x () x (2)... x (n) Median divides ordered data values, such that 50% of the data values are lower than or equal to the median and 50% are greater than or equal to the median Calculate median x 0.5 as x 0.5 = { x( n+ 2 ) if n is odd 2 (x (n/2) + x (n/2 +) ) if n is even Measures for location or central tendency Definition: Arithmetic mean or average of a sample x = n The mean of a population is often denoted by µ with µ = N N x i x i
Measures for location or central tendency Other measures Geometric mean Harmonic mean Weighted mean/average Truncated or trimmed mean Winsorized mean Measures for location Definition: Quantile Generalization of median α [0, ] The α-quantile divides the ordered data values such that nα% values are lower than or equal to the α-quantile and n( α)% values are greater than or equal to the α-quantile Calculation (usually software dependent, here the book version): x (k) if nα is not an integer, than x α = k is the smallest integer > nα 2 (x (nα) + x (nα+) ) if nα is integer Measures for location Special quantiles α {0., 0.2, 0.3,..., 0.9}: deciles α {0.25, 0.75}: first and third quartile α = 0: minimum α = : maximum.2 Definition: Variance of a sample The variance for a sample is defined as s 2 = n (x i x) 2 or s 2 n = n (x i x) 2 2
Loss of one degree of freedom, since n (x i x) = 0 x minimizes the average of the squared deviations The standard deviation s is then s = s 2 Note: s is not the same as Standard error of the mean (SEM) The variance of a population is σ 2 = N N (x i µ) 2 with µ = N N x i Definition: Variance decomposition k groups (x, x 2,..., x n,),, (x k, x 2k,..., x nk,k) with x j = x ij, j =,..., k and Then s 2 = (x ij x j ) 2, j =,..., k k k s 2 n = n j= with n = k j= and x = n k j= x j ( x j x) 2 + n j= s 2 Definition: Median absolute deviation The median absolute deviation (MAD) is MAD = n x i x 0.5 x 0.5 minimizes the average of the absolute deviations Definition: Range The range is Range = x (n) x () 3
Definition: Interquartile range (IQR) The IQR is IQR = x 0.75 x 0.25 Coefficient of variation Definition: Coefficient of variation v = s x. Assumption: X has positive values.3 Measure for skewness Skewness Definition: Skewness Figure : Distributions which are symmetric, negative skewed and positive skewed The skewness of a sample is g = n ( n (x i x) 3 ) 3 (x i x) 2.4 Graphical display, five point summary of a sample Five point summary: Minimum, first quartile, median, third quartile, maximum. version of a boxplot displays these measures The simple The extended version is given on the next slide 4
, extended version Calculate x 0.25, x 0.5, x 0.75 and IQR Draw a box bounded by x 0.25 and x 0.75, mark x 0.5 with a line Any data observation which lies more than.5 IQR lower than the first quartile or.5 IQR higher than the third quartile is considered an outlier. Indicate the smallest value that is not an outlier by connecting it to the box with a line or "whisker". Indicate outliers by open and closed dots (or stars). "Extreme" outliers, or those which lie more than 3 IQR below the first and above third quartiles respectively, are indicated by the presence of a closed dot or star. "Mild" outliers - that is, those observations which lie more than.5 IQR from the first and third quartile but are not also extreme outliers are indicated by the presence of a open dot. Design of a * o Extreme outlier (Mild) Outlier Whisker Third Quartile Median First Quartile Minimal value which is no outlier Magnitudes data 5
3 4 5 6 7 Noise data 50 00 50 0 50 00 50 s: some notes s are useful to compare distributions (e.g. for different groups) s are more useful than error bars (e.g. x ± 2 SEM) s give a hint for the shape of the distribution (symmetric or not) Multiple modes can not be detected See http://en.wikipedia.org/wiki/box_plot for alternative forms s need to tuning values as e.g. the number of bins in a histogram or the bandwidth in a kernel density estimate 6