Problems With Using Microsoft Excel for Statistics



Similar documents
Predictor Coef StDev T P Constant X S = R-Sq = 0.0% R-Sq(adj) = 0.

When to use Excel. When NOT to use Excel 9/24/2014

Figure 1. An embedded chart on a worksheet.

EXCEL Analysis TookPak [Statistical Analysis] 1. First of all, check to make sure that the Analysis ToolPak is installed. Here is how you do it:

2013 MBA Jump Start Program. Statistics Module Part 3

Data Analysis Tools. Tools for Summarizing Data

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Using Excel for descriptive statistics

Directions for using SPSS

3 The spreadsheet execution model and its consequences

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Scatter Plots with Error Bars

Statistical flaws in Excel

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

Data exploration with Microsoft Excel: analysing more than one variable

An introduction to using Microsoft Excel for quantitative data analysis

Principles of Data Visualization for Exploratory Data Analysis. Renee M. P. Teate. SYS 6023 Cognitive Systems Engineering April 28, 2015

Projects Involving Statistics (& SPSS)

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Data exploration with Microsoft Excel: univariate analysis

Using Excel s Analysis ToolPak Add-In

Drawing a histogram using Excel

Visualization Quick Guide

Exercise 1.12 (Pg )

Chapter 13 Introduction to Linear Regression and Correlation Analysis

2. Simple Linear Regression

January 26, 2009 The Faculty Center for Teaching and Learning

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

MTH 140 Statistics Videos

How To Run Statistical Tests in Excel

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

Activity 3.7 Statistical Analysis with Excel

Additional sources Compilation of sources:

Advanced Excel for Institutional Researchers

Business Valuation Review

One-Way ANOVA using SPSS SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

The importance of graphing the data: Anscombe s regression examples

Chapter 7: Simple linear regression Learning Objectives

Simple Linear Regression Inference

Diagrams and Graphs of Statistical Data

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

TIPS FOR DOING STATISTICS IN EXCEL

Software for Intro Stats: Is Excel an Option?

Moderation. Moderation

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE

How to Discredit Most Real Estate Appraisals in One Minute By Eugene Pasymowski, MAI 2007 RealStat, Inc.

Chapter 23. Inferences for Regression

Chapter 9. Two-Sample Tests. Effect Sizes and Power Paired t Test Calculation

Using Excel for Statistical Analysis

Regression step-by-step using Microsoft Excel

Simple linear regression

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

NCSS Statistical Software

Data analysis and regression in Stata

IBM SPSS Direct Marketing 23

Linear Models in STATA and ANOVA

1. Go to your programs menu and click on Microsoft Excel.

Numbers as pictures: Examples of data visualization from the Business Employment Dynamics program. October 2009

IBM SPSS Direct Marketing 22

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

An SPSS companion book. Basic Practice of Statistics

An introduction to IBM SPSS Statistics


Simple Linear Regression, Scatterplots, and Bivariate Correlation

How To Test For Significance On A Data Set

Introduction Course in SPSS - Evening 1

Using Excel for inferential statistics

SPSS Manual for Introductory Applied Statistics: A Variable Approach

SPSS Tests for Versions 9 to 13

Dealing with Data in Excel 2010

Simple Regression Theory II 2010 Samuel L. Baker

Getting started in Excel

12: Analysis of Variance. Introduction

Principles of Data Visualization

Descriptive Statistics

The Dummy s Guide to Data Analysis Using SPSS

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Difference of Means and ANOVA Problems

Simple Predictive Analytics Curtis Seare

CSU, Fresno - Institutional Research, Assessment and Planning - Dmitri Rogulkin

Using Excel for Statistical Analysis

NCSS Statistical Software

SPSS Explore procedure

How To Write A Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

not possible or was possible at a high cost for collecting the data.

STC: Descriptive Statistics in Excel Running Descriptive and Correlational Analysis in Excel 2013

Violent crime total. Problem Set 1

Using Microsoft Excel to Analyze Data from the Disk Diffusion Assay

AP Physics 1 and 2 Lab Investigations

Transcription:

Proceedings of the Annual Meeting of the American Statistical Association, August 5-9, 2001 Problems With Using Microsoft Excel for Statistics Jonathan D. Cryer (Jon-Cryer@uiowa.edu) Department of Statistics and Actuarial Science University of Iowa, Iowa City, Iowa Joint Statistical Meetings August 2001, Atlanta, GA In this talk I will illustrate Excel s serious deficiencies in five areas of basic statistics: and Graphics Help Screens Computing Algorithms Treatment of Missing Data Example: Excel Graphics With False Third Dimension (taken from JSE!) The vast majority of Chart types offered by Excel should NEVER be used! Our next example shows the graph-types available as pyramid charts. None of these choices shown below represent good graphs! All but the last one display false third dimensions. In addition they all suggest stacked displays that are known to be poor ways to make comparisons. Example: Pyramid Charts Regression We begin with basic graphics. Good Graphs Should: Portray Numerical Information Visually Without Distortion Contain No Distracting Elements (e.g., no false third dimensions nor chartjunk ) Label Axes (Scales) and Tick Marks Appropriately Have a Descriptive Title and/or Caption and Legend (References: Cleveland (1993, 1994) and Tufte (1983, 1990, 1997)) However, Excel meets virtually none of these criteria. As our first example illustrates, Excel offers false third dimensions on the vast majority of its graphs. (Unfortunately, this example is taken from the Journal of Statistical Education.) (For the similar reasons, Excel s column, cone, and cylinder charts don t seem to have any redeeming features either!) Scatterplots represent bread-and-butter graphs for visualizing relationships between variables. Scatterplots Should Have: Good Choice of Axes Meaningful Legends No False Third Dimensions

However, Excel s default scatterplots leave much to be desired. In the following example two data points have been covered up by the axis labels. Can you find them? And is the legend displayed to the right of the graph useful? Note that there is no label for the horizontal axis. Example: Excel Default Scatterplot Example: Excel Histogram (stretched vertically to read labels) Histograms are another basic statistical display. Histograms Should Have: No Meaningless Gaps A Reasonable Choice of Bins An Easy Way To Choose Or Adjust The Bins A Good Aspect Ratio Meaningful Labels on Axes Appropriate Labels on Bin Tick Marks However, the next example shows a default histogram produced by Excel. The bin labels are impossible to read, the aspect ratio is poor, the legend and horizontal axis label are useless. Example: Excel Default Histogram If we click on the graph and stretch it vertically, we can then read the bin labels. The choice of class intervals or bins is rather bizarre, the number of digits displayed is atrocious, and it is not at all clear what tick marks these labels apply to. In any software, the help screens should give useful and accurate information. In particular: Help Screens Should: Not Confuse Give Accurate Statistical Information Be Helpful! However, Excel s help for statistics is quite poor. Here is an example of the Help screen displayed when you do a two-sample t-test. Example: Excel 2000 Help Screen for the Two-sample T-Test t-test: Two-Sample Assuming Equal Variances analysis tool This analysis tool performs a two-sample student's t-test. This t-test form assumes that the means of both data sets are equal; it is referred to as a homoscedastic t-test. You can use t-tests to determine whether two sample means are equal. These sentences contain a number of basic errors. About the only value in them would be to ask your students to critique them and locate the many errors!

The next example shows the help supplied for the confidence interval function. Example: Excel 2000 Confidence Function CONFIDENCE Returns the confidence interval for a population mean. The confidence interval is a range on either side of a sample mean. For example, if you order a product through the mail, you can determine, with a particular level of confidence, the earliest and latest the product will arrive. [emphasis mine] The material emphasized, is, of course, a basic misstatement of the meaning of a confidence interval. A last example displays the help given for the standard deviation function. Example: Excel 2000 STDEV Function STDEV Estimates standard deviation based on a sample. The standard deviation is a measure of how widely values are dispersed from the average value (the mean). (snip...) Remarks (snip...) The standard deviation is calculated using the "nonbiased" or "n-1" method. STDEV uses the following formula: n x 2 x 2 ------------------------------------------------ nn ( 1) This help item introduces a new term, nonbiased, but that is the least of the difficulties here. (And, of course, the standard deviation given here is not unbiased for the population standard deviation under any set of assumptions that I know of!) More importantly, the formula given, the so-called computing formula, is well-known to be a very poor way to compute a standard deviation. We return to this below. Excel is especially deficient in its statistical analysis when some of the data are missing. Excel Makes Selecting Predictor Variables In Regression Especially Difficult When Data Missing As an example, here is a simple paired dataset with some of the data missing (NA= not available or missing): Pre Post 1 1 NA 2 3 3 4 NA 5 5 6 6 7 7 8 8 9 9 Here is the output of the paired data analysis of these data with the Excel Data Analysis Toolpack: t-test: Paired Two Sample for Means Variable 1 Variable 2 Mean 5.375 5.125 Variance 7.125 8.410714286 Observations 8 8 Pearson Correlation 1 Hypothesized Mean Difference 0 df 7 t Stat -1 P(T<=t) one-tail 0.17530833 t Critical one-tail 1.89457751 P(T<=t) two-tail 0.35061666 t Critical two-tail 2.36462256 Means, variances, and df are all wrong (for paired data)! Nothing here is of much use! (But a naive user might not know or even notice!) One of the well-documented deficiencies of Excel is its choice of computing algorithms. Treatment of Missing Data Excel Does It Incorrectly Excel Does It Inconsistently

Computing Algorithms for Basic Statistics Excel Uses Poor Algorithms To Find The Standard Deviation (See Help screen for STDEV shown above) Get the Right Tool for the Job! Excel Defines The First Quartile To Be The Ordered Observation At Position (n+3)/4 Excel Does Not Treat Tied Observations Correctly When Ranking Regression Computations Are Often Erroneous Due To Poor Algorithms (See below) In addition Excel, usually displays many more digits than appropriate. (See histogram and paired t-test output shown above.) Finally, Excel has major and documented difficulties with its regression procedures. Regression in Excel Does Not Treat Zero-Intercept Models Correctly Sometimes Gets Negative Sums Of Squares Does Not Handle Multicollinearity Correctly Computes Standardized Residuals Incorrectly! Displays Normal Probability Plots That Are Completely Wrong! Makes Variable Selection Very Difficult References Friends Don t Let Friends Use Excel for Statistics! Allen, I. E. (1999), The Role of Excel for Statistical Analysis, Making Statistics More Effective in Schools of Business 14th Annual Conference Proceedings, ed. A. Rao, Wellesley: http://weatherhead.cwru.edu/ msmesb/ In summary: Due to substantial deficiencies, Excel should not be used for statistical analysis. We should discourage students and practitioners from such use. The following pretty much sums it up: Callaert, H. (1999), Spreadsheets and Statistics: The Formulas and the Words, Chance, 12, 2, p. 64. Cleveland, W. S., Visualizing Data, 1993, Hobart Press, Summit, NJ Cleveland, W. S., The Elements of Graphing Data, Revised Edition, 1994, Hobart Press, Summit, NJ Goldwater, Eva, Data Analysis Group, Academic Computing, University of Massachusetts, Using Excel for Statistical Data Analysis: Successes and Cautions, November 5, 1999, www-unix.oit.umass.edu/~evagold/excel.html

Knusel, Leo, On the Accuracy of Statistical Distributions in Microsoft Excel 97, Computational Statistics and Data Analysis, 1998, 26, pp. 375-377 McCullough, B.D. and Wilson B. (1999) "On the Accuracy of Statistical Procedures in Microsoft Excel 97", Computational Statistics and Data Analysis, 31, pp. 27-37. McKenzie, Jr., J. D., and Rybolt, W. H. (1994), What is the Most Appropriate Software for a Statistics Course?, Computer Science and Statistics: Proceedings of Twenty-Sixth Symposium on the Interface, United States of America: Interface Foundation of North America. (1996), Excel as a Statistical Package: Past, Present, and Future presented at COMPSTAT '96, XII Symposium on Computational Statistics, Barcelona, Spain. Sawitzki, Gunther, Report on the Numerical Reliability of Data Analysis Systems, Computational Statistics and Data Analysis, 1994, 18, pp. 289-301 Simon, Gary, ASSUME (Association of Statistics Specialists Using Microsoft Excel), untitled 19 page Word file, http://www.jiscmail.ac.uk/cgi-bin/ wa.exe?a2=ind0012&l=assume&d=0&p=830 Simonoff, Jeffry, Stern School of Business, New York University, Statistical Analysis Using Microsoft Excel, 2000, www.stern.nyu.edu/~jsimonof/classes/1305/ pdf/excelreg.pdf Tufte, E. R., The Visual Display of Quantitative Information, Graphics Press, Cheshire, Conn., 1983 Tufte, E. R., Envisioning Information, Graphics Press, Cheshire, Conn., 1990 Tufte, E. R., Visual Explanations, Graphics Press, Cheshire, Conn., 1997