Are our data symmetric?

Transcription

1 Statistical Methods in Medical Research 2003; 12: 505^513 Are our data symmetric? Sumithra J Mandrekar and Jayawant N Mandrekar Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA Skewness indicates a lack of symmetry in a distribution. Knowing the symmetry of the underlying data is essential for parametric analysis, tting distributions or doing transformations to the data. The coef cient of skewness is the commonly used measure to identify a lack of symmetry in the underlying data, although graphical procedures can also be effective. We discuss three different methods to assess skewness: traditional coef cient of skewness index, skewness index based on the L-moments discussed by Hosking and the asymptotic test of symmetry developed by Randles et al. With this work, we provide easy-toimplement S-PLUS 1 functions as well as discuss the advantages and shortcomings of each technique. 1 Introduction The rst step in any statistical analysis includes summarizing the characteristics of the underlying data. All standard statistica l packages routinely provide summary statistics information, and this often includes a sample skewness score, which is a measure of symmetry. Symmetry is a rather complex property of probability distributions and it is dif cult to identify deviations from it in a small number of observations. Broadly speaking, a dataset or a distribution is said to be symmetric if it looks the same to the right and left of the center point. One of the numerous reasons for checking symmetry in a given dataset is that many statistica l tests rely strongly on the assumption of normality, which in turn relies on symmetry. Thus, a skewness measure can provide valuable information on issues such as data transformation, outlier detection, distribution tting, and so on, so as to ensure that an appropriate analysis procedure (parametric versus nonparametric) is employed. In our paper we compare three different methods used to assess skewness: traditional coef cient of skewness index, skewness index based on the L-moments discussed by Hosking, 1 and the asymptotic distribution-free test of symmetry (symmetry test) developed by Randles et al. 2 Royston 3 has demonstrated that the L-moments-based skewness measure has no serious drawbacks in practical data applications whereas the traditional coef cient of skewness index suffers from several theoretical and practical disadvantages. Further, using a Monte Carlo study, Randles et al. 2 showed that the nonparametric test of symmetry is superior to the test based on the sample skewness index. In the past, each of these indices have been explored individually and we aim to compare these three competitors based on their complexity, accuracy and accessibility, and offer practical guidelines using real-life and simulated datasets. Address for Correspondence: Sumithra J Mandrekar, Research Associate (Lead Statistician), Cancer Center Statistics, Kahler 1A, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA. mandrekar.sumithra@mayo.edu # Arnold / sm346oa

2 506 SJ Mandrekar and JN Mandrekar 2 Skewness indices In this section we will give a brief description of the three different procedures used to compute skewness. We only present the formulae necessary to compute these indices=test statistics and the readers are referred to Hosking 1 and Randles et al. 2 for further details about the theoretical background. Let x (1 ), x (2 ),..., x (n) be the ordered random sample of size n from a distribution of the random variable X with mean m and variance s 2. The coef cient of skewness is de ned as S 1 ˆ P n m3 (m 2 ) 3=2 ; where m iˆ1 r ˆ (x i x) r n Here, m r is the rth sample moment about the sample mean. For symmetrical distributions, S 1 has expectation 0, that is, when the data is symmetric, the sample skewness coef cient is near zero. If S 1 > 0, then the distribution is asymmetric with a positive skew and if S 1 < 0, then the distribution is asymmetric with a negative skew. The larger the absolute value of S 1, the more asymmetric is the distribution. (See Gupta 4 for a test based on this sample skewness coef cient.) The estimates of the sample L-skewness are given by 1 where S 2 ˆ l3 l 2 P n iˆ2 l 2 ˆ 2w 2 x; l 3 ˆ 6w 3 6w 2 x; w 2 ˆ (i 1)x (i) ; n(n 1) P n iˆ3 w 3 ˆ (i 1)(i 2)x (i) ; and 1 < S n(n 1)(n 2) 2 < 1: An alternative L-skewness index, S 0 2 ˆ (1 S 2)=(1 S 2 ), has also been de ned by Hosking 1 and its properties have been discussed. 1,3 The index S 0 2 is easier to interpret than S 2, as it is the ratio of the length of the upper tail to the lower tail in samples of size 3. S 0 2 therefore ranges from 0 to 1 and values of 1, >1 and <1 indicate symmetric, positively skewed and negatively skewed distributions. In the case of S 2, a value of 0 indicates symmetry, 1 < S 2 < 0 indicates a negatively skewed distribution and 0 < S 2 < 1 indicates a positively skewed distribution. Both S 1 and S 2 (S 0 2 ) are measures of skewness (whose numerical values quantify symmetry or asymmetry as the case may be), with no widely used test statistics associated with them. The distribution-free test of symmetry, however, tests if a univariate distribution is symmetric about some unknown value against a broad class of symmetric distribution alternatives. To discuss the test statistic proposed by Randles et al. 2 based on the n unordered observations of X, rst consider every triple (X i, X j, X k ), 1 µ i < j < k µ n (all the

3 notations used in this paper are consistent with the discussion given in Wolfe and Hollander 5 ). A set of three distinct observations is called a right triple when the middle observation is closer to the smallest than to the largest (and hence is skewed to the right) and is called a left triple when the middle observation is closer to the largest than to the smallest (and hence is skewed to the left). De ne f*(x i, X j, X k ) ˆ [sign(x i X j 7 2X k )] [sign(x i X k 7 2X j )] [sign(x j X k 7 2X i )], where sign(y) ˆ 1, 1, 0 if y is less than, greater than or equal to 0 respectively. If f*(x i, X j, X k ) ˆ 1, it is a right triple, it is a left triple if its value is 1, and it is neither a left nor a right triple if its value is 0. Note that the test statistic is well de ned when zeros occur in the computation of (X i X j 7 2X k ; 8 i, j, k). We then compute the following for the entire dataset: For each xed t ˆ 1,..., n, let T ˆ [number of right triples] [number of left triples] B t ˆ [number of right triples with X t ] [number of left triples with X t ] and for each xed pair (s; t); 1 µ s < t µ n, let B s;t ˆ [number of right triples with X s ; X t ] [number of left triples with X s ; X t ]: The test statistic is based on the above combinations of the number of right and left triples in the entire dataset and when n is large, its distribution is well approximated by the normal distribution. In particular, the test statistic S 3 is given by where S 3 ˆ T s ( s 2 (n 3)(n 4) X n ˆ B 2 (n 3) t (n 1)(n 2) (n 4) tˆ1 µ ¼ (n 3)(n 4)(n 5) 1 T 2 n(n 1)(n 2) X n 1 X n sˆ1 tˆs 1 Are our data symmetric? 507 B 2 n(n 1)(n 2) s;t 6 From a sample of size n, there are n C 2 distinct triples, and if the null hypothesis of symmetry holds, then we expect half of them to be right triples and half of them to be left triples. Roughly speaking, any substantial deviation in either direction (either more right triples or more left triples) is indicative of asymmetry in the underlying population. The null hypothesis of symmetry against the general alternative of asymmetry at a speci ed level of signi cance, a and large n, is rejected if j S 3 j Z a=2. Appropriate one-sided tests can be done to check for speci c deviations (right or left skewness) from symmetry (see Wolfe and Hollander 5 for further details). Although computationally intensive, the results from this test are accurate even for small sample sizes and display good power in detecting asymmetric distributions as

4 508 SJ Mandrekar and JN Mandrekar compared to sample skewness measures. 2 We have written functions for computing this test statistic as well as the L-skewness in S-PLUS 1 (version 6.0, Release 1). 3 Some considerations 3.1 Accuracy and interpretability The sample skewness coef cient is sensitive to even small changes in the tail of the distribution, whereas L-skewness and symmetry tests are sensitive to changes in the shape of the main portion (in the middle as opposed to the tail). The sample skewness coef cient is susceptible to moderate outliers in the sample since cubes of extreme deviations are highly in uential. Royston 3 further demonstrated that the sample skewness coef cient is a poor estimator of skewness in skew distributions as compared to the L-skewness, which is more reliable. The symmetry test is not effective at identifying asymmetry when sample sizes are small (<20). 2,6 Both the symmetry test and the L-skewness can provide a measure of relative skewness, whereas the sample skewness coef cient is less interpretable in terms of the distribution features. 3.2 Complexity The sample skewness coef cient is the easiest to compute, followed by the L-skewness and the symmetry test statistics. The L-skewness requires the data to be sorted in an increasing order, whereas the symmetry test requires considering every triple of observations for computing the test statistic. Although the symmetry test displays good power, when the sample size is large, it is computationally intensive. 3.3 Accessibility The sample skewness coef cient is part of many standard statistical packages and hence is accessible and also ef cient in terms of time required for computation. Our readily available S-PLUS 1 functions make it feasible to compute the L-skewness and perform the symmetry test. There are, however, trade-offs between computation time and power for the symmetry test, when the sample size is large. 4 An illustration Between January 1974 and May 1984, Mayo Clinic conducted a double-blinded randomized trial in primary biliary cirrhosis of the liver (PBC), comparing the drug D-penicillamine (DPCA) with a placebo. 7 There were 424 patients who met the eligibility criteria seen at the Clinic while the trial was open for patient registration: 312 cases had complete data, 112 cases did not participate in the trial but consented to have basic measurements recorded, and six were lost to follow-up. This dataset has been used for several purposes: estimating survival distribution, testing for differences between the drug and placebo groups, and estimating covariate effects via a regression model. The variables of interest are serum cholesterol (mg=dl), albumin (gm=dl), and triglycerides (mg=dl), where we check for symmetry using all three approaches. As a rst step, an exploratory analysis (Figure 1) revealed asymmetric distributions for all the

5 Are our data symmetric? 509 Figure 1. study. Histograms of serum cholesterol, albumin and triglycerides from the primary biliary cirrhosis (PBC) variables except albumin (which is symmetric once a few outlying values are removed). The results from the formal computations of the skewness indices (Table 1) support the visual observation that all of the variables are positively skewed except albumin, which has a slight negative skew. The removal of a few small values ( µ 2.6 gm=dl) makes the distribution of albumin symmetric (Table 1 and Figure 1). One major issue with using S 1 ; S 2 (S 0 2 ) or S 3 as skewness indicators is quanti cation. Before we proceed, it is important to note that we cannot compare across the measures as the ranges are different and hence all comparisons are made within a skewness index, across the variables. The rst thing to note is that all the variables (except albumin) are positively skewed, irrespective of which measure is used as the deviations from Table 1 Variable Skewness indices for the variables from the primary biliary cirrhosis (PBC) study Skewness indices S 1 S 2 (S 2 0 ) S 3 (stat=p-value) Serum cholesterol (2.49) (<0.001) Triglycerides (1.77) 9.13 (<0.001) Albumin (0.86) 3.02 (0.002) Albumin (>2.6) (0.98) 0.59 (0.56) p-values reported for S 3 statistics are two-sided.

6 510 SJ Mandrekar and JN Mandrekar symmetry are caused by changes in the tails as well as the middle of the distribution. For instance, S 1 is 3.39 units above zero (where zero indicates symmetry) in the case of serum cholesterol, and 2.51 units above zero in the case of triglycerides. S 2 (S 0 2 ) are 0.43 units above zero (1.49 units above one), where zero indicates symmetry for S 2 and one indicates symmetry for S 0 2 for serum cholesterol, and 0.28 units above zero (0.77 units above one) for triglycerides. If we go by the magnitude of either the sample skewness coef cient and=or L-skewness, would it mean that the distribution of serum cholesterol is more positively skewed than triglycerides and if so, by how much more or less? Similar is the case with S 3, where the p-values for both serum cholesterol and triglycerides are very small, thus indicating the presence of asymmetry in the distribution. Under these circumstances, the L-skewness index, S 0 2, can provide some indication of the relative skewness, as discussed below. Let us explore the triglyceride variable a little further. In addition to the histogram of the raw data (Figure 1), a quantile quantile (Q Q) plot of triglycerides on the linear and logarithmic scale (Figure 2) shows that a log-transformation makes the distribution of triglycerides normal (symmetric). Computation of the skewness indices (S 1 ˆ 0:35; S 2 ˆ 0:05(S 0 2 ˆ 1:10) or S 3 ˆ 1:57; p-value ˆ 0:06) on the log-transformed variable provides evidence of considerable reduction in the asymmetry. In terms of interpretability, the L-skewness index, S 0 2, of the untransformed triglyceride variable Figure 2. Q Q plots of triglyceride variable on linear and logarithmic scales.

7 Are our data symmetric? 511 Table 2 Skewness indices for the simulations from a standard normal distribution Sample size Skewness indices S 1 S 2 (S 2 0 ) S 3 (stat=p-value) (0.04) 0.22 (0.83) (0.48) 0.28 (0.78) (2.99) 0.69 (0.49) (3.7) 0.36 (0.72) (9.4) 1.16 (0.25) p-values reported for S 3 statistics are two-sided. suggests that the upper tail (in samples of size 3) is about two times longer than the lower tail, and in the log-transformed variable, this ratio is only about 10%. 5 Are these necessary and su cient? As a simple illustration, we generate random samples of size 10, 30, 75, 100 and 250 from a standard normal distribution (mean ˆ 0, variance ˆ 1), and compute the skewness indices based on all the three methods (Table 2). Based on the values for sample skewness coef cient and L-skewness, we conclude that the data is not symmetric for any sample size. Now, these samples are generated from a standard normal distribution (a symmetric distribution); however, the Q Q plots (Figure 3 gives Q Q plots for sample sizes 30 and 250) show the presence of longer tails and slight deviations toward the middle of the distribution as compared to a standard normal distribution. Since there are changes in both the tail and the middle of the distribution, it is captured by both the sample skewness coef cient and the L-skewness indices and interpreted as being asymmetric. Remember that the sample skewness coef cient is very sensitive to even small changes in the extremes, whereas the L-skewness is responsive when there are overall changes to the shape of the distribution, particularly in the middle. Here is a classic case where we should not make decisions on data transformations to meet the normality assumption (as in our case, the underlying data is already normal) based on these two skewness measures. Also, there is a lack of consistency across the two skewness indices in terms of conclusions of asymmetry. Based on S 1, all of the data except the rst and the last (n ˆ 10, 250) show evidence of a negative skew, whereas based on S 2 (S 0 2 ), only the rst two (n ˆ 10, 30) are classi ed as negatively skewed. Based on the symmetry test, we conclude that the data are symmetric for any sample size, which is to be expected as all of the samples are generated from a normal (symmetric) distribution. 6 Discussion The big question is: which measure is appropriate? It probably suf ces to say that this is situation dependent, particularly on how important the symmetry of the underlying

8 512 SJ Mandrekar and JN Mandrekar Figure 3. Sample Q Q plots for the simulated normal random variables. Top: n ˆ 30; bottom: n ˆ 250. data is for the purposes of the study. The rst step would still be to do a quick plotting of the data to get a sense of its distribution. As we have shown, several factors like sample size, interpreta bility, complexity, and accessibility play a vital role in the selection of the skewness measure. Each measure has its own share of positives and negatives. The sample skewness coef cient and the L-skewness are both readily available (although the L-skewness is not routinely produced as part of a statistica l output, but can be coded easily and quickly) and computationally less intensive compared to the symmetry test. Between them, it has been shown that the L-skewness is more interpreta ble and less sensitive to extreme deviations in the tails. The symmetry test displays good power in detecting asymmetry against a broad class of symmetric distribution alternatives. We therefore propose that if complexity and computational time are not constraints (mainly in the case of large sample sizes), then the symmetry test is considerably better than either of the summary skewness measures. References 1 Hosking JRM. L-moments: analysis and estimation of distributions using linear combinations of order statistics. Journal of the Royal Statistical Society, Series B 1990; 52: Randles RH, Fligner MA, Policello GE II, Wolfe DA. An asymptotically distribution-free test for symmetry versus asymmetry. Journal of the American Statistical Association 1980; 369(75):

9 Are our data symmetric? Royston P. Which measures of skewness and kurtosis are best? Statistics in Medicine 1990; 11: Gupta MK. An asymptotically nonparametric test of symmetry. Annals of Mathematical Statistics 1967; 38(3): Wolfe DA, Hollander M. Nonparametric Statistical Methods, 2nd edn. New York: John Wiley, Davis CE, Quade D. U-statistics for skewness or symmetry. Communications of Statistics Theory and Methods 1978; A7(5): Flemming TR, Harrington DP. Counting processes and survival analysis. New York: John Wiley, 1991.