BIOL Assessing normality

Size: px
Start display at page:

Download "BIOL Assessing normality"


1 Assessing normality Not all continuous random variables are normally distributed. It is important to evaluate how well the data set seems to be adequately approximated by a normal distribution. In this section some statistical tools will be presented to check whether a given set of data is normally distributed. 1. Previous knowledge of the nature of the distribution Problem: A researcher working with sea stars needs to know if sea star size (length of radii) is normally distributed. What do we know about the size distributions of sea star populations? 1. Has previous work with this species of sea star shown them to be normally distributed? 2. Has previous work with a closely related species of seas star shown them to be normally distributed? 3. Has previous work with seas stars in general shown them to be normally distributed? If you can answer yes to any of the above questions and you do not have a reason to think your population should be different, you could reasonably assume that your population is also normally distributed and stop here. However, if any previous work has shown non-normal distribution of sea stars you had probably better use other techniques. 2. Construct charts For small- or moderate-sized data sets, the stem-and-leaf display and box-andwhisker plot will look symmetric. For large data sets, construct a histogram or polygon and see if the distribution bell-shaped or deviates grossly from a bell-shaped normal distribution. Look for skewness and asymmetry. Look for gaps in the distribution intervals with no observations. However, remember that normality requires more than just symmetry; the fact that the histogram is symmetric does not mean that the data come from a normal distribution. Also, data sampled from normal distribution will sometimes look distinctly different from the parent distribution. So, we need to develop some techniques that allow us to determine if data are significantly different from a normal distribution. 3. Normal Counts method Count the number of observations within 1, 2, and 3 standard deviations of the mean and compare the results with what is expected for a normal distribution in the rule. According to the rule, 68% of the observations lie within one standard deviation of the mean. 95% of observations within two standard deviations of the mean. 99.7% of observations within three standard deviations of the mean. Example: As part of a demonstration one semester, I collected data on the heights of sample of 25 biostatistics students. These data are presented in the table below. Does the sample shown below have been drawn from normally distributed populations? 59

2 Table. Heights, in inches, of 25 biostatistics students Solution: For normal Counts method, determine the following Heights, in Frequency inches Total = Total 24 x = 70.6; s = 2.3 x ± s is 72.9 to out of the 24 observations i.e. 17/24 = 0.70 = 70% fall within x ± s, i.e. between 72.9 and 68.3, which is approximately equal to 68%.There is no reason to doubt that the sample is drawn from a normal population. 4. Compute descriptive summary measures a. The mean, median and mode will have similar values. b. The interquartile range approximately equal to 1.33 s. c. The range approximately equal 6 s. 5. Evaluate normal probability plot If the data come from a normal or approximately normal distribution, the plotted points will fall approximately along a straight line (a 45 degree line). However, if your sample departs from normality, the points on the graph will deviate from that line. If they trail off from a straight-line pack in a curve at the top end, observed values bigger than expected, that s right skewed (see below).if the observed values trail off at the bottom end, that s left skewed. Realize that it is important note that any worthwhile computer statistics package will construct these graphs for you (see below). 60

3 6. Measure of Skewness and Kurtosis Skewness: The normal distribution is symmetrical. Asymmetrical distributions are sometimes called skewed. Skewness is calculated as follows: 3 n ( xi - x) i= 1 skewness = 3 s ( n -1)( n -2) where x is the mean, s is the standard deviation, and n is the number of data points A perfectly normal distribution will have a skewness statistic of zero. If this statistic departs significantly from 0, then we lose confidence that our sample comes from a normally distributed population. If it is negative, then the distribution is skewed to the left or negatively skewed distribution. If it is positive, then the distribution is skewed right or positively skewed distribution. n Negatively skewed distribution or Skewed to the left Skewness <0 Normal distribution Symmetrical Skewness = 0 Positively skewed distribution or Skewed to the right Skewness > 0 Kurtosis: A bell curve will also depart from normality if the tails fail to fall off at the proper rate. If they decrease too fast, the distribution ends up too peaked. If they don t decrease fast enough, the distribution is too flat in the middle and too fat in the tails. One statistic commonly used to measure kurtosis is typically calculated using the formula, kurtosis 4 2 nn ( + 1) x i x 3( n 1) = ( n 1)( n 2)( n 3) s ( n 2)( n 3) where x is the mean, s is the standard deviation, and n is the number of data points A perfectly normal distribution will also have a kurtosis statistic of zero. If kurtosis is significantly less than zero, then our distribution is flat, it is said to be platykurtic. If kurtosis is significantly greater than 0, the distribution is pointed or peaked, it is called leptokurtic. Platykurtic distribution Low degree of peakedness Kurtosis <0 Normal distribution Mesokurtic distribution Kurtosis = 0 Leptokurtic distribution High degree of peakedness Kurtosis > 0 61

4 You won t have you calculate it by hand. The calculation itself is sensitive to rounding errors because they are raised to the third and fourth powers. Using SPSS to Evaluate Data for Normality Before the advent of good computers and statistical programs, users could be forgiven for trying to avoid any surplus calculations. Now that both are available and much easier to use, tests for normality should always be carried out as a best practice in statistics. SPSS offers a variety of methods for evaluating normality. Normal probability plot (P-P plot) The P-P plot graphs the expected cumulative probability against the observed cumulative probability. 1. Open the SPSS file containing your data. 2. From the main menu, select Graph and then P-P From the list of available variables, move the variables you wish to analyze to the variable window. If you select multiple variables then SPSS will create separate plots for each. 3. In the box for Test Distribution be sure that the pop-up menu is set for a Normal distribution. In addition, be sure that the Estimate from data box is checked. 4. In the box for Proportion Estimation Formula, select the radio button for the Rankit method. 5. Finally, in the Ranks Assigned to Ties box, select the radio button for High. 6. Click on OK to obtain the plot and complete your analysis 1.00 Normal P-P Plot of Student's heights (inches) Expected Cum Prob Observed Cum Prob Q-Q plot Repeat the above steps, but in this time select Q-Q (main menu, Graph and then Q-Q) 62

5 Normal Q-Q Plot of Student's heights (inches) Expected Normal Observed Value Limitation of visual method One limitation to any visual approach for evaluating normality is that your conclusion is open to some uncertainty. How, for example, can you put a quantitative statement on the confidence of your conclusion? How linear is linear and how much deviation from linearity is acceptable? One approach to obtaining a more quantitative determination of whether a data set is normally distributed is the Kolmogorov-Smirnov test or Shapiro- Wilk s tests. Use data set to tests of normality 1. Open the SPSS file containing your data and from the main menu select Analyze 2. Descriptive Statistics 3. Explore 4. Move your variable from the Variable List window to the Dependent List window. 5. Under Display click Both 6. Click Plots 7. Under Boxplots check Factor levels together 8. Under Descriptive check " Histogram" and stem-and-leaf 9. Check Normality plots with tests 10. Click continue 11. Click OK 12. Evaluate the plot for evidence of normality. 63

6 Descriptives Student's heights (inches) Mean 95% Confidence Interval for Mean Lower Bound Upper Bound Statistic Std. Error % Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Assessment of skewness and kurtosis In fact (like all estimates), we are unlikely to ever see the values of zero in either skewness or kurtosis statistics. The real question is whether the given estimates vary significantly from zero. For this question we need the standard error of skewness, and similarly the standard error of kurtosis. In SPSS, the Explore command provides skewness and kurtosis statistics at once in addition to the standard errors of skewness and kurtosis. The key value we are looking for is whether the value of zero is within the 95% confidence interval. For assessing skewness: = = For assessing kurtosis: = = Thus the 95% confidence interval for the skewness score ranges from to , and the 95% confidence interval for the kurtosis score ranges from to In both cases, zero is within our bounds thus we can accept that our statistic is not significantly different from a distribution of zero. Therefore this is normal distribution. Kolmogorov-Smirnov / Shapiro-Wilk s tests for normality The first is the Kolmogorov-Smirnov test for normality, sometimes termed the KS Lilliefors test. The second is the Shapiro-Wilk s test. The advice from SPSS is to use the latter test (Shapiro-Wilk s test) when sample sizes are small (n < 50). The output, using the data from above is presented below. 64

7 Tests of Normality Student's heights (inches) Kolmogorov-Smirnov a Shapiro-Wilk Statistic df Sig. Statistic df Sig * *. This is a lower bound of the true significance. a. Lilliefors Significance Correction The value listed as Asymp. Sig. is probability lies between 0 and 1. In general, a sig. value 0.05 is considered good evidence that the data set is not normally distributed. A value greater than 0.05 implies that there is insufficient evidence to suggest that the data set is not normally distributed. In our example, the significance of accordingly, means that our distribution is not significantly different from a normal distribution. Boxplots It is very hard to detect normality using a box plot. But at the very least, look for symmetry and the presence of outliers. Severe skewness and/or outliers are indication of non-normality. Normal distribution: If there are only a few outliers, and the median line evenly divides the box, then data values in a sample that otherwise comes from a normal or near-normal distribution. Skewness: If there are numerous outliers to one side or the other of the box, or the median line does not evenly divide the box, then the population distribution from which the data were sampled may be skewed. Skewness to the right: If the boxplot shows outliers at the upper range of the data (above the box), the median line does not evenly divide the box, and the upper tail of the boxplot is longer than the lower tail, then the population distribution from which the data were sampled may be skewed to the right. Here is a hypothetical example of a boxplot for data sampled from a distribution that is skewed to the right: 65

8 Skewness to the left: If the boxplot shows outliers at the lower range of the data (below the box), the median line does not evenly divide the box, and the lower tail of the boxplot is longer than the upper tail, then the population distribution from which the data were sampled may be skewed to the left. Here is a hypothetical example of a boxplot for data sampled from a distribution that is skewed to the left. Negatively Skewed Symmetric (Not Skewed) Positively Skewed From the boxplots of student heights in our example, we see that the distribution appear to be reasonably symmetric and approximately normal N = 24 Student's heights (i 66

9 Normalizing Transformations Many of the statistical tests (parametric tests) are based on the assumption that the data are normally distributed. However, if we actually plot the data from a study, we rarely see perfectly normal distributions. Most often, the data will be skewed to some degree or show some deviation from mesokurtosis. Two questions immediately arise: A) Can we analyze these data with parametric tests and. if not, B) Is there something we can do to the data to make them more normal? What to do if Not Normal? According to some researchers, sometimes violations of normality are not problematic for running parametric- tests. When a variable is not normally distributed (a distributional requirement for many different analyses), we can create a transformed variable and test it for normality. If the transformed variable is normally distributed, we can substitute it in our analysis. Data transformation Data transformation involves performing a mathematical operation on each of the scores in a set of data, and thereby converting the data into a new set of scores which are then employed to analyze the results of an experiment. To solve for Positive Skew Square roots, logarithmic, and inverse (1/X) transforms "pull in" the right side of the distribution in toward the middle and normalize right (positive) skew. Inverse transforms are stronger than logarithmic, which are stronger than roots. Square root transformation The square-root transformation can be effective in normalizing distributions that have a slightly to moderate positive skew. Data taken from a Poisson distribution are sometimes effectively normalized with a square-root transformation. The square-root transformation is obtained through use of the equation Y = X, where X is the original score (observation) and Y represents the transformed score. Cube roots, fourth roots, etc., will be increasingly effective for data that are increasingly skewed. When you use the square root transformation, be careful; don't have any zeros or negative numbers among your raw data If for example, there are any zero values, add a constant C, where C is some small positive value such as 0.5, and replace each observation by X If there are negative numbers with positive numbers, add a constant to each number to make all values greater than 0. Although this transformation is not used as frequently in medicine as the log transformation, it can be very useful when a log transformation overcorrects. Logarithmic transformation A logarithmic transformation may be useful in normalizing distributions that have more severe positive skew than a square-root transformation. Such distribution is termed lognormal because it can be made normal by log transforming the values. 67

10 When log transforming data, we can choose to take logs either to base 10 (the 'common' log) or to base e (the 'natural' log, abbreviated ln). The log transformation is similar to the square root transformation in that zeros and negative numbers are taboo. Use the same technique to eliminate them. Some people use the smallest possible value for their variable as a constant, others use an arbitrarily small number, such as or, most commonly, 1. The back-transformation of a log is called the antilog; the antilog of the natural log is the exponential, e (e = ). In medicine, the log transformation is frequently used because many variables have right-skewed distributions. Inverse or reciprocal transformation A reciprocal transformation exerts the most extreme adjustment with regard to normality. It is used to normalize very or absolutely skewed data. Accordingly, the reciprocal transformation is often able to normalize data that the square-root and logarithmic transformations are unable to normalize. The reciprocal transformation is obtained through use of the equation Y = 1/x. If any of the scores are equal to zero, the equation y= l/(x+ 1) should be employed. When inversed, large numbers become small, and small numbers become large. It's possible that you chose a transformation that overcorrected and turned a moderate left skew into a moderate right one. This gains you nothing except heartache. So, if this has happened, go back and try a less "powerful" transformation; perhaps square root rather than log, or log rather than reciprocal. To solve for Negative Skew If skewness is actually negative, "flip" the curve over, so the skew left curves become skewed right, allowing us to use the transformation procedures of positively skewed distributions. Flipping the curve over require reflection of the variable before transforming. Reflection simply involves the following: Before the data are transformed, we can find the maximum value (9), add 1 to it (to avoid too many zeros when we're done), and subtract each raw value from this number. For example, if we started out with the numbers : then we would subtract each number from 10 (the maximum, 9, plus 1), yielding : We would then use the transformations for right -skewed data, rather than left-skewed. More transformations Power transformation Y = (X) p : The most common type of transformation useful for biological data is the power transformation, which transforms X to (X) p, where p is power greater than zero. Values of p less than one correct right skew, which is the common situation (using a power of 2/3 is common when attempting to normalize). Values of p greater than 1 correct left skew. The square transformation i.e. p = 2 for example, achieves the reverse of the square-root transformation. If X is skewed to the left (negatively skewed), the distribution of Y = (X) p is often approximately Normal. For right skew, decreasing p decreases right skew. Too great reduction of p will overcorrect and cause left skew. 68

11 The arcsine (arcsin) transformation: The arcsine of a number is the angle whose sine is that number. The arcsine transformation (also referred to as an angular or inverse sine transformation) is used to normalize data when data are Proportions between 0 and 1 or percentages between 0% and 100%. The arcsine transformation which expresses the value of Y in degrees is obtained through use of the equation Y = arcsin X, where X will be a proportion between 0 and 1. This means that, the arcsine transformation requires two steps. First, obtain the square root of x. Second, by using calculator, find the angle whose sine equal this value. To get the arcsine value for a percentage (e.g. 50%), divide this by 100 ( = 0.5), take the square root (= ), then press "sin-1" key in your calculator to get the arcsine value (= 45). To get the arcsine value for a proportion (e.g. 0.4), take the square root (= ), then press "sin-1" to get the arcsine value (= 39.23). In Excel after obtaining the square root X (Fig. 1), find the ASIN function (Fig.2) and then multiply by 57.3 to find the arcsine in degrees. Through use of the equations that compute an arcsine in degrees, is 0 (for a proportion of zero) to 90 (for a proportion of 1). Fig. 1 Fig. 2 After any transformation, you must re-check your data to ensure the transformation improved the distribution of the data (or at least didn t make it any worse!). Sometimes, log or square root transformations can skew data just as severely in the opposite direction. If transformation does not bring data to a normal distribution, the investigators might well choose a nonparametric procedure that does not make any assumptions about the shape of the distribution. SPSS/PC You can use the COMPUTE command to transform the data. Select Transform - Compute - Target Variable (input a new variable name) - Numeric Expression (input transform formula). It's usually a good idea to make up a new variable to hold the transformed data, rather than over-writing the original values; in this way, you can easily undo your mistakes. 69

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

Lesson 4 Measures of Central Tendency

Lesson 4 Measures of Central Tendency Outline Measures of a distribution s shape -modality and skewness -the normal distribution Measures of central tendency -mean, median, and mode Skewness and Central Tendency Lesson 4 Measures of Central

More information

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur Module No. #01 Lecture No. #15 Special Distributions-VI Today, I am going to introduce

More information

Frequency Distributions

Frequency Distributions Descriptive Statistics Dr. Tom Pierce Department of Psychology Radford University Descriptive statistics comprise a collection of techniques for better understanding what the people in a group look like

More information

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Descriptive Statistics Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Statistics as a Tool for LIS Research Importance of statistics in research

More information


CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information


HYPOTHESIS TESTING WITH SPSS: HYPOTHESIS TESTING WITH SPSS: A NON-STATISTICIAN S GUIDE & TUTORIAL by Dr. Jim Mirabella SPSS 14.0 screenshots reprinted with permission from SPSS Inc. Published June 2006 Copyright Dr. Jim Mirabella CHAPTER

More information

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Types of Variables Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs Quantitative (numerical)variables: take numerical values for which arithmetic operations make sense (addition/averaging)

More information

Exploratory Data Analysis. Psychology 3256

Exploratory Data Analysis. Psychology 3256 Exploratory Data Analysis Psychology 3256 1 Introduction If you are going to find out anything about a data set you must first understand the data Basically getting a feel for you numbers Easier to find

More information

6 3 The Standard Normal Distribution

6 3 The Standard Normal Distribution 290 Chapter 6 The Normal Distribution Figure 6 5 Areas Under a Normal Distribution Curve 34.13% 34.13% 2.28% 13.59% 13.59% 2.28% 3 2 1 + 1 + 2 + 3 About 68% About 95% About 99.7% 6 3 The Distribution Since

More information

6.4 Normal Distribution

6.4 Normal Distribution Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

More information

A Picture Really Is Worth a Thousand Words

A Picture Really Is Worth a Thousand Words 4 A Picture Really Is Worth a Thousand Words Difficulty Scale (pretty easy, but not a cinch) What you ll learn about in this chapter Why a picture is really worth a thousand words How to create a histogram

More information

One-Way ANOVA using SPSS 11.0. SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

One-Way ANOVA using SPSS 11.0. SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate 1 One-Way ANOVA using SPSS 11.0 This section covers steps for testing the difference between three or more group means using the SPSS ANOVA procedures found in the Compare Means analyses. Specifically,

More information

Chapter 7 Section 7.1: Inference for the Mean of a Population

Chapter 7 Section 7.1: Inference for the Mean of a Population Chapter 7 Section 7.1: Inference for the Mean of a Population Now let s look at a similar situation Take an SRS of size n Normal Population : N(, ). Both and are unknown parameters. Unlike what we used

More information


INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA) INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA) As with other parametric statistics, we begin the one-way ANOVA with a test of the underlying assumptions. Our first assumption is the assumption of

More information

Projects Involving Statistics (& SPSS)

Projects Involving Statistics (& SPSS) Projects Involving Statistics (& SPSS) Academic Skills Advice Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs,

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median CONDENSED LESSON 2.1 Box Plots In this lesson you will create and interpret box plots for sets of data use the interquartile range (IQR) to identify potential outliers and graph them on a modified box

More information

Week 1. Exploratory Data Analysis

Week 1. Exploratory Data Analysis Week 1 Exploratory Data Analysis Practicalities This course ST903 has students from both the MSc in Financial Mathematics and the MSc in Statistics. Two lectures and one seminar/tutorial per week. Exam

More information


HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS Mathematics Revision Guides Histograms, Cumulative Frequency and Box Plots Page 1 of 25 M.K. HOME TUITION Mathematics Revision Guides Level: GCSE Higher Tier HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information

Chapter 3. The Normal Distribution

Chapter 3. The Normal Distribution Chapter 3. The Normal Distribution Topics covered in this chapter: Z-scores Normal Probabilities Normal Percentiles Z-scores Example 3.6: The standard normal table The Problem: What proportion of observations

More information


Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY 1. Introduction Besides arriving at an appropriate expression of an average or consensus value for observations of a population, it is important to

More information

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction CA200 Quantitative Analysis for Business Decisions File name: CA200_Section_04A_StatisticsIntroduction Table of Contents 4. Introduction to Statistics... 1 4.1 Overview... 3 4.2 Discrete or continuous

More information

The Dummy s Guide to Data Analysis Using SPSS

The Dummy s Guide to Data Analysis Using SPSS The Dummy s Guide to Data Analysis Using SPSS Mathematics 57 Scripps College Amy Gamble April, 2001 Amy Gamble 4/30/01 All Rights Rerserved TABLE OF CONTENTS PAGE Helpful Hints for All Tests...1 Tests

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

More information

Lecture 1: Review and Exploratory Data Analysis (EDA)

Lecture 1: Review and Exploratory Data Analysis (EDA) Lecture 1: Review and Exploratory Data Analysis (EDA) Sandy Eckel Department of Biostatistics, The Johns Hopkins University, Baltimore USA 21 April 2008 1 / 40 Course Information I Course

More information

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI STATS8: Introduction to Biostatistics Data Exploration Babak Shahbaba Department of Statistics, UCI Introduction After clearly defining the scientific problem, selecting a set of representative members

More information


THE KRUSKAL WALLLIS TEST THE KRUSKAL WALLLIS TEST TEODORA H. MEHOTCHEVA Wednesday, 23 rd April 08 THE KRUSKAL-WALLIS TEST: The non-parametric alternative to ANOVA: testing for difference between several independent groups 2 NON

More information

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds Isosceles Triangle Congruent Leg Side Expression Equation Polynomial Monomial Radical Square Root Check Times Itself Function Relation One Domain Range Area Volume Surface Space Length Width Quantitative

More information

5/31/2013. 6.1 Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives.

5/31/2013. 6.1 Normal Distributions. Normal Distributions. Chapter 6. Distribution. The Normal Distribution. Outline. Objectives. The Normal Distribution C H 6A P T E R The Normal Distribution Outline 6 1 6 2 Applications of the Normal Distribution 6 3 The Central Limit Theorem 6 4 The Normal Approximation to the Binomial Distribution

More information

Using SPSS, Chapter 2: Descriptive Statistics

Using SPSS, Chapter 2: Descriptive Statistics 1 Using SPSS, Chapter 2: Descriptive Statistics Chapters 2.1 & 2.2 Descriptive Statistics 2 Mean, Standard Deviation, Variance, Range, Minimum, Maximum 2 Mean, Median, Mode, Standard Deviation, Variance,

More information


MEASURES OF VARIATION NORMAL DISTRIBTIONS MEASURES OF VARIATION In statistics, it is important to measure the spread of data. A simple way to measure spread is to find the range. But statisticians want to know if the data are

More information

SKEWNESS. Measure of Dispersion tells us about the variation of the data set. Skewness tells us about the direction of variation of the data set.

SKEWNESS. Measure of Dispersion tells us about the variation of the data set. Skewness tells us about the direction of variation of the data set. SKEWNESS All about Skewness: Aim Definition Types of Skewness Measure of Skewness Example A fundamental task in many statistical analyses is to characterize the location and variability of a data set.

More information

Error Type, Power, Assumptions. Parametric Tests. Parametric vs. Nonparametric Tests

Error Type, Power, Assumptions. Parametric Tests. Parametric vs. Nonparametric Tests Error Type, Power, Assumptions Parametric vs. Nonparametric tests Type-I & -II Error Power Revisited Meeting the Normality Assumption - Outliers, Winsorizing, Trimming - Data Transformation 1 Parametric

More information

Analyzing Data with GraphPad Prism

Analyzing Data with GraphPad Prism 1999 GraphPad Software, Inc. All rights reserved. All Rights Reserved. GraphPad Prism, Prism and InStat are registered trademarks of GraphPad Software, Inc. GraphPad is a trademark of GraphPad Software,

More information

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),

More information

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4) Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

More information

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER seven Statistical Analysis with Excel CHAPTER chapter OVERVIEW 7.1 Introduction 7.2 Understanding Data 7.3 Relationships in Data 7.4 Distributions 7.5 Summary 7.6 Exercises 147 148 CHAPTER 7 Statistical

More information

Diagrams and Graphs of Statistical Data

Diagrams and Graphs of Statistical Data Diagrams and Graphs of Statistical Data One of the most effective and interesting alternative way in which a statistical data may be presented is through diagrams and graphs. There are several ways in

More information

Lecture Notes Module 1

Lecture Notes Module 1 Lecture Notes Module 1 Study Populations A study population is a clearly defined collection of people, animals, plants, or objects. In psychological research, a study population usually consists of a specific

More information

Exploratory Data Analysis

Exploratory Data Analysis Exploratory Data Analysis Johannes Schauer Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz February 12, 2008 Introduction

More information

Session 7 Bivariate Data and Analysis

Session 7 Bivariate Data and Analysis Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares

More information

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties: Density Curve A density curve is the graph of a continuous probability distribution. It must satisfy the following properties: 1. The total area under the curve must equal 1. 2. Every point on the curve

More information


EPS 625 INTERMEDIATE STATISTICS FRIEDMAN TEST EPS 625 INTERMEDIATE STATISTICS The Friedman test is an extension of the Wilcoxon test. The Wilcoxon test can be applied to repeated-measures data if participants are assessed on two occasions or conditions

More information


DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 1. Descriptive Statistics Statistics

More information

Describing, Exploring, and Comparing Data

Describing, Exploring, and Comparing Data 24 Chapter 2. Describing, Exploring, and Comparing Data Chapter 2. Describing, Exploring, and Comparing Data There are many tools used in Statistics to visualize, summarize, and describe data. This chapter

More information


MBA 611 STATISTICS AND QUANTITATIVE METHODS MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain

More information

Module 4: Data Exploration

Module 4: Data Exploration Module 4: Data Exploration Now that you have your data downloaded from the Streams Project database, the detective work can begin! Before computing any advanced statistics, we will first use descriptive

More information

Normality Testing in Excel

Normality Testing in Excel Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author.

More information

Introduction to Quantitative Methods

Introduction to Quantitative Methods Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................

More information

First Midterm Exam (MATH1070 Spring 2012)

First Midterm Exam (MATH1070 Spring 2012) First Midterm Exam (MATH1070 Spring 2012) Instructions: This is a one hour exam. You can use a notecard. Calculators are allowed, but other electronics are prohibited. 1. [40pts] Multiple Choice Problems

More information

Exploring Data: The Beast of Bias

Exploring Data: The Beast of Bias Sources of Bias Exploring Data: The Beast of Bias A bit of revision. We ve seen that having collected data we usually fit a model that represents the hypothesis that we want to test. This model is usually

More information

Sta 309 (Statistics And Probability for Engineers)

Sta 309 (Statistics And Probability for Engineers) Instructor: Prof. Mike Nasab Sta 309 (Statistics And Probability for Engineers) Chapter 2 Organizing and Summarizing Data Raw Data: When data are collected in original form, they are called raw data. The

More information

NCSS Statistical Software

NCSS Statistical Software Chapter 06 Introduction This procedure provides several reports for the comparison of two distributions, including confidence intervals for the difference in means, two-sample t-tests, the z-test, the

More information

Measurement with Ratios

Measurement with Ratios Grade 6 Mathematics, Quarter 2, Unit 2.1 Measurement with Ratios Overview Number of instructional days: 15 (1 day = 45 minutes) Content to be learned Use ratio reasoning to solve real-world and mathematical

More information

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques

More information

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance 2 Making Connections: The Two-Sample t-test, Regression, and ANOVA In theory, there s no difference between theory and practice. In practice, there is. Yogi Berra 1 Statistics courses often teach the two-sample

More information

Def: The standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1.

Def: The standard normal distribution is a normal probability distribution that has a mean of 0 and a standard deviation of 1. Lecture 6: Chapter 6: Normal Probability Distributions A normal distribution is a continuous probability distribution for a random variable x. The graph of a normal distribution is called the normal curve.

More information


SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES SCHOOL OF HEALTH AND HUMAN SCIENCES Using SPSS Topics addressed today: 1. Differences between groups 2. Graphing Use the s4data.sav file for the first part of this session. DON T FORGET TO RECODE YOUR

More information

AP * Statistics Review. Descriptive Statistics

AP * Statistics Review. Descriptive Statistics AP * Statistics Review Descriptive Statistics Teacher Packet Advanced Placement and AP are registered trademark of the College Entrance Examination Board. The College Board was not involved in the production

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

Module 2: Introduction to Quantitative Data Analysis

Module 2: Introduction to Quantitative Data Analysis Module 2: Introduction to Quantitative Data Analysis Contents Antony Fielding 1 University of Birmingham & Centre for Multilevel Modelling Rebecca Pillinger Centre for Multilevel Modelling Introduction...

More information

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

More information

Data Transforms: Natural Logarithms and Square Roots

Data Transforms: Natural Logarithms and Square Roots Data Transforms: atural Log and Square Roots 1 Data Transforms: atural Logarithms and Square Roots Parametric statistics in general are more powerful than non-parametric statistics as the former are based

More information


ANALYSIS OF TREND CHAPTER 5 ANALYSIS OF TREND CHAPTER 5 ERSH 8310 Lecture 7 September 13, 2007 Today s Class Analysis of trends Using contrasts to do something a bit more practical. Linear trends. Quadratic trends. Trends in SPSS.

More information


Chapter 3 RANDOM VARIATE GENERATION Chapter 3 RANDOM VARIATE GENERATION In order to do a Monte Carlo simulation either by hand or by computer, techniques must be developed for generating values of random variables having known distributions.

More information

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers 1.3 Measuring Center & Spread, The Five Number Summary & Boxplots Describing Quantitative Data with Numbers 1.3 I can n Calculate and interpret measures of center (mean, median) in context. n Calculate

More information



More information

The correlation coefficient

The correlation coefficient The correlation coefficient Clinical Biostatistics The correlation coefficient Martin Bland Correlation coefficients are used to measure the of the relationship or association between two quantitative

More information

Summarizing and Displaying Categorical Data

Summarizing and Displaying Categorical Data Summarizing and Displaying Categorical Data Categorical data can be summarized in a frequency distribution which counts the number of cases, or frequency, that fall into each category, or a relative frequency

More information

Mathematics within the Psychology Curriculum

Mathematics within the Psychology Curriculum Mathematics within the Psychology Curriculum Statistical Theory and Data Handling Statistical theory and data handling as studied on the GCSE Mathematics syllabus You may have learnt about statistics and

More information

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions. Algebra I Overview View unit yearlong overview here Many of the concepts presented in Algebra I are progressions of concepts that were introduced in grades 6 through 8. The content presented in this course

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

Skewed Data and Non-parametric Methods

Skewed Data and Non-parametric Methods 0 2 4 6 8 10 12 14 Skewed Data and Non-parametric Methods Comparing two groups: t-test assumes data are: 1. Normally distributed, and 2. both samples have the same SD (i.e. one sample is simply shifted

More information

Means, standard deviations and. and standard errors

Means, standard deviations and. and standard errors CHAPTER 4 Means, standard deviations and standard errors 4.1 Introduction Change of units 4.2 Mean, median and mode Coefficient of variation 4.3 Measures of variation 4.4 Calculating the mean and standard

More information

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals

More information

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences Introduction to Statistics for Psychology and Quantitative Methods for Human Sciences Jonathan Marchini Course Information There is website devoted to the course at marchini/phs.html

More information

The Normal Distribution

The Normal Distribution Chapter 6 The Normal Distribution 6.1 The Normal Distribution 1 6.1.1 Student Learning Objectives By the end of this chapter, the student should be able to: Recognize the normal probability distribution

More information

Introduction; Descriptive & Univariate Statistics

Introduction; Descriptive & Univariate Statistics Introduction; Descriptive & Univariate Statistics I. KEY COCEPTS A. Population. Definitions:. The entire set of members in a group. EXAMPLES: All U.S. citizens; all otre Dame Students. 2. All values of

More information

Review of Fundamental Mathematics

Review of Fundamental Mathematics Review of Fundamental Mathematics As explained in the Preface and in Chapter 1 of your textbook, managerial economics applies microeconomic theory to business decision making. The decision-making tools

More information

13: Additional ANOVA Topics. Post hoc Comparisons

13: Additional ANOVA Topics. Post hoc Comparisons 13: Additional ANOVA Topics Post hoc Comparisons ANOVA Assumptions Assessing Group Variances When Distributional Assumptions are Severely Violated Kruskal-Wallis Test Post hoc Comparisons In the prior

More information

SPSS Explore procedure

SPSS Explore procedure SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,

More information

January 26, 2009 The Faculty Center for Teaching and Learning

January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS A USER GUIDE January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS Table of Contents Table of Contents... i

More information

Statistics. Measurement. Scales of Measurement 7/18/2012

Statistics. Measurement. Scales of Measurement 7/18/2012 Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does

More information

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing! MATH BOOK OF PROBLEMS SERIES New from Pearson Custom Publishing! The Math Book of Problems Series is a database of math problems for the following courses: Pre-algebra Algebra Pre-calculus Calculus Statistics

More information


AP STATISTICS REVIEW (YMS Chapters 1-8) AP STATISTICS REVIEW (YMS Chapters 1-8) Exploring Data (Chapter 1) Categorical Data nominal scale, names e.g. male/female or eye color or breeds of dogs Quantitative Data rational scale (can +,,, with

More information

SAS Analyst for Windows Tutorial

SAS Analyst for Windows Tutorial Updated: August 2012 Table of Contents Section 1: Introduction... 3 1.1 About this Document... 3 1.2 Introduction to Version 8 of SAS... 3 Section 2: An Overview of SAS V.8 for Windows... 3 2.1 Navigating

More information

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices:

Doing Multiple Regression with SPSS. In this case, we are interested in the Analyze options so we choose that menu. If gives us a number of choices: Doing Multiple Regression with SPSS Multiple Regression for Data Already in Data Editor Next we want to specify a multiple regression analysis for these data. The menu bar for SPSS offers several options:

More information

AP Statistics Solutions to Packet 2

AP Statistics Solutions to Packet 2 AP Statistics Solutions to Packet 2 The Normal Distributions Density Curves and the Normal Distribution Standard Normal Calculations HW #9 1, 2, 4, 6-8 2.1 DENSITY CURVES (a) Sketch a density curve that

More information

Algebra 1 Course Information

Algebra 1 Course Information Course Information Course Description: Students will study patterns, relations, and functions, and focus on the use of mathematical models to understand and analyze quantitative relationships. Through

More information


CHAPTER THREE. Key Concepts CHAPTER THREE Key Concepts interval, ordinal, and nominal scale quantitative, qualitative continuous data, categorical or discrete data table, frequency distribution histogram, bar graph, frequency polygon,

More information

Mind on Statistics. Chapter 2

Mind on Statistics. Chapter 2 Mind on Statistics Chapter 2 Sections 2.1 2.3 1. Tallies and cross-tabulations are used to summarize which of these variable types? A. Quantitative B. Mathematical C. Continuous D. Categorical 2. The table

More information

4. Continuous Random Variables, the Pareto and Normal Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions 4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random

More information

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces

The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces The Effect of Dropping a Ball from Different Heights on the Number of Times the Ball Bounces Or: How I Learned to Stop Worrying and Love the Ball Comment [DP1]: Titles, headings, and figure/table captions

More information

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple.

Pie Charts. proportion of ice-cream flavors sold annually by a given brand. AMS-5: Statistics. Cherry. Cherry. Blueberry. Blueberry. Apple. Graphical Representations of Data, Mean, Median and Standard Deviation In this class we will consider graphical representations of the distribution of a set of data. The goal is to identify the range of

More information

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r), Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

More information