Guide to the Summary Statistics Output in Excel

Transcription

1 How to read the Descriptive Statistics results in Excel PIZZA BAKERY SHOES GIFTS PETS Mean Standard Error Median Mode #N/A #N/A Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count Mean Standard Error Median Mode The mean (or more specifically, the Arithmetic Mean) is a frequently used measure of central tendency that is calculated by adding up all the values in a data set and then dividing by the number of observations in the data set. More specifically called the Standard Error of the Mean, this tells you how accurate your estimate of the Mean is likely to be if it is based upon a random sample of larger population. The Mean calculated from a large sample size will produce a lower Standard Error than the Mean calculated from a small sample size. The Median, which is the midpoint between the lower half of a set of values and the upper half of the set. It can provide a more useful measure of central tendency than the Arithmetic Mean if the sample being measured contains extreme outliers. The Mode is a measure of central tendency that is calculated by determining the most common value in a data set. For example, the Mode of the set {1, 1, 2, 3, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8, 9} is 5, because 5 appears more times than any other number in the set. In some cases, Excel may not be able to calculate the Mode. For example, if all of the numbers in a certain data set are different, Excel will return n/a instead of a number. 1

2 Standard Deviation Sample Variance Kurtosis Skewness Range Maximum Minimum Sum Count Standard Deviation is a measure of dispersion in statistics, represented by the Greek letter sigma, σ. It shows how much of a dataset is spread out around the mean or average. This tells you how much a data set varies and is used to calculate the Standard Deviation, which is considered a more useful measure of variability. Kurtosis is used to describe the shape of a distribution specifically how the middle part of the distribution compares to the tails. A negative kurtosis indicates that there are too many data points in the tails, which results in a somewhat flat distribution. A positive kurtosis indicates that too few data points are in the tails, resulting in a bulge in the center of the distribution. A normal distribution, which is symmetrical and bell shaped, has a kurtosis of 0. Skewness is also used to describe the shape of a distribution, but specifically in regard to whether one of the tails extends out farther than the other. If the left and right halves of a distribution are perfect mirror images of each other, it has a skewness of zero and is called a symmetric distribution. A negative skewness means that the tail on the left side is longer, while a positive skewness means that the tail on the right side is longer. The Range shows the spread between the maximum and minimum values in the data set. The highest value in the data set. The lowest value in the data set. The summation of all the values in the data set. The total number of values in the data set. 2

3 How to read the Correlation results in Excel Sales ($M) Profits ($M) # Employ Assets profit/emp %profit/sales Sales ($M) 1 Profits ($M) # Employ Assets profit/emp %profit/sales Correlation Coefficient Also known as R or Pearson s r, the Correlation Coefficient is a measure of the linear correlation (dependence) between two variables X and Y, giving a value between +1 and 1 inclusive, where 1 is total positive correlation, 0 is no correlation, and 1 is total negative correlation. The interpretation of a Correlation Coefficient depends upon the context and purposes. For example, a Correlation of 0.8 may be considered very low if one is verifying a law of physics using highprecision instruments, but may be regarded as very high in the social sciences where there may be a greater contribution from complicating factors. 3

4 How to read the Regression Statistics results in Excel SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 27 ANOVA df SS MS F Significance F Regression E-22 Residual Total Coefficients Standard Error t Stat P-value Lower 95% Upper 95% ower 95.0%Upper 95.0% Intercept SQFT INVENTORY ADVERTISING FAMILIES E STORES Multiple R R Square (R 2 ) Adjusted R Square Standard Error Observations This is the correlation coefficient between the actual and the fitted values. It tells you how strong the linear relationship is. For example, a value of 1 means a perfect positive relationship and a value of zero means no relationship at all. The R Square tells you how many points fall on the regression line. For example, an R Square of.75 would mean that 75% of the variation of y values around the mean can be explained by the x values. In other words, 75% of the values fit the model. Adjusted R Square is used if there is more than one x variable. The Standard Error of the regression is an estimate of the standard deviation of the error μ. This is not the same as the standard error in descriptive statistics. The standard error of the regression is the precision that the regression coefficient is measured; if the coefficient is large compared to the standard error, then the coefficient is probably different from 0. The Observations figure tells us the size of the data set. 4

5 How to read the ANOVA section of the Regression Statistics results in Excel ANOVA df SS MS F Significance F Regression E-22 Residual Total Coefficients Standard Error t Stat P-value Lower 95% Upper 95% ower 95.0%Upper 95.0% Intercept SQFT INVENTORY ADVERTISING FAMILIES E STORES In the upper part of the ANOVA table, there are three rows labelled Regression, Residual and Total. The Regression row refers to the fitted values, the Residual row refers to the residuals, and the Total row refers to the overall Y values. df (Degrees of Freedom) The Degrees of Freedom number on the Regression line shows the number of predictive variables being used in the calculation. In other words, this is the number of values in the statistical equation that are free for us to vary. The Residual Degrees of Freedom equals the Total df minus the Regression df. The Total Degrees of Freedom is n 1, where n is the sample size. SS (Sum of Squares) MS (Mean Square) F In regression, the total Sum of Squares helps express the total variation of the y values. Comparing the Regression Sum of Squares to the Total Sum of Squares, shows the proportion of the total variation that is explained by the regression model (R 2, the coefficient of determination). The larger this value is, the better the relationship explaining the dependent variable as a function of the independent variables. The Mean Squares value is an estimate of population variance and is calculated by dividing the Sum of Squares by the Degrees of Freedom. In regression, Mean Squares are used to determine whether terms in the model are significant. The F statistic is used to calculate the significance of the regression. It is analogous to the chi square statistic in categorical data. 5

6 Significance F Coefficients Standard Error t Stat P value Lower 95% Upper 95% The significance level for F is like a P value. This gives you the least squares estimate. The standard error refers to the estimated standard deviation of the error term u. It is sometimes called the standard error of the regression. The T Statistic for the null hypothesis vs. the alternate hypothesis. A very large t Stat implies that the coefficient has a fair amount of accuracy. If the t Stat is more than 2, you would generally conclude that the variable in question has a significant impact on the dependent variable. This gives you the p value for the hypothesis test. If you want your model to be significant, you need a P value 0.05 (for 95%) The lower boundary for the confidence interval. The upper boundary for the confidence interval. 6

7 How to read the ANOVA Single Factor results in Excel Anova: Single Factor SUMMARY Groups Count Sum Average Variance PIZZA BAKERY SHOES GIFTS PETS ANOVA Source of Variation SS df MS F P value F crit Between Groups Within Groups Total Count Sum Average Variance SS (Sum of Squares) The number of observations in each group. The summation of the values for each group. The Arithmetic Mean for each group. This column shows how dispersed the values are in each group. The Sum of Squares is a measure of variation or deviation from the mean and is shown for between the groups, within each group and the total combined. In ANOVA, the Total Sum of Squares helps express the total variation that can be attributed to various factors. 7

8 df (Degrees of Freedom) The Degrees of Freedom shows the number of predictive variables being used in the calculation. In other words, this is the number of values in the statistical equation that are free for us to vary. Degrees of Freedom between groups is calculated by taking the number of groups and subtracting 1 in other words, k 1, where k is the number of groups. Degrees of Freedom within groups is calculated by taking the number of observations minus 1 and multiplying the result times the number of groups in other words, k(n 1), where k is the number of groups and n is the sample size. Total Degrees of Freedom is the sum of the previous two figures. MS (Mean Squares) F The Mean Squares value is an estimate of population variance and is calculated by dividing the Sum of Squares by the Degrees of Freedom. This statistic allows us to compare these ratios and determine whether there is a significant difference due to the factors. The larger this ratio is, the more the factors affect the outcome. The F statistic is calculated by taking the Mean Square between groups and dividing it by the Mean Square within groups. The F statistics needs to be higher than the F Critical value in order to be able to reject the null hypothesis. P Value If the P Value is higher than the Alpha level (which is usually set at.05 in the initial settings for the ANOVA test), you can reject the null hypothesis. F Critical The F Critical Value is the number that the F statistic must exceed in order to be able to reject the null hypothesis. 8

9 How to read the t Test results in Excel The t Test is used in hypothesis testing to tell if the difference between the means of two populations is significant or instead, due to chance. This is done indirectly by testing the null hypothesis that the means of two populations are equal. Excel provides three different options for doing a t Test: 1. t Test: Paired Two Sample For Means 2. T Test: Two Sample assuming Equal Variances 3. T Test: Two Sample assuming Unequal Variances The t Test: Paired Two Sample For Means is used when the sample observations are naturally paired. The usual reason for choosing this option is when you are testing the same group twice. For example, if you are testing a new drug, you will want to compare the sample before and after the research participants take the drug to see if the results are different. This particular test in Excel uses a paired two sample t Test to determine if the before and after observations are likely to have been derived from distributions with equal population means. The other two t Test options are used when different groups are being examined (i.e. you are not testing one group twice over time). The T Test: Two Sample assuming Equal Variances test is used when you know (either through the question or you have analyzed the variance in the data) that the variances are the same. The T Test: Two Sample assuming Unequal Variances test is used when either you know the variances are not the same, or you do not know if the variances are the same or not. In most cases, you don t know if the variances are equal or not, so you would use the Two Sample assuming Unequal Variances test. The Excel t test function reports information on both the one sided and two sided tests in the same output. For a one sided two sample t Test, only the two lines of output labeled one tail immediately below the line containing the value of the t Stat should be examined. For a two sided two sample t Test, use the next two lines of output. 9

10 t Test: Two Sample Assuming Equal Variances Air Helium Mean Variance Observations Pooled Variance Hypothesized Mean Difference 0 df 76 t Stat P(T<=t) one tail t Critical one tail P(T<=t) two tail t Critical two tail Mean Variance Observations Pooled Variance Hypothesized Mean Difference df (Degrees of Freedom) This line shows us the Mean for each of the two data sets. This line shows how dispersed the values are in each data set. It can help confirm whether to the use the t Test option assuming equal variances or the one assuming unequal variances. The Observations line tells us the size of both data sets. This value is used to estimate the variance of separate populations when the mean of each population may be different, but it is safe to assume that the variance of each population is the same. Under the assumption of that the population variances are equal, the pooled sample variance provides a higher precision estimate of variance than the individual sample variances. This higher level of precision can provide increased statistical sensitivity when used in statistical tests that compare the populations, like the t Test. This value is an assumption which depends upon our original hypothesis. If our hypothesis is that there is no difference between the means of the two data sets, this should be set at 0 when setting up the t Test in Excel. If our hypothesis is that there is a difference between the two mean of at least x, then the Hypothesized Mean Difference should be set at x. The Degrees of Freedom shows the number of values in the statistical equation that are free for us to vary. It is equal to the sum of the Observations for both data sets minus 2 (n 1 + n 2 2). 10

11 t Stat P(T<=t) one tail t Critical one tail P(T<=t) two tail t Critical two tail If the t Statistic is less than the t Critical value, we cannot reject the null hypothesis. If the t Statistic is greater than the t Critical value, we can reject the null hypothesis. In a one tail test, this is the p value, which represents the probability that a value of t greater than or equal to the calculated value could have occurred by chance if the null hypothesis is true. The p value is the fractional area of the right tail (one sided) of the t distribution above the calculated value of t Stat. In a one tail test, the t Stat is compared to this value in order to determine whether or not to reject the null hypothesis. If the t Stat is greater than the t Critical, that indicates that the means of the two data sets are significantly different. In a two tail test, this is the p value, which represents the probability that a value of t greater than or equal to the calculated t Stat value could have occurred by chance if there were no difference in the means. In a two tail test, the t Stat is compared to this value in order to determine whether or not to reject the null hypothesis. If the t Stat is greater than the t Critical, that indicates that the means of the two data sets are significantly different 11