Module 4: Data Exploration



Similar documents
STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Module 5: Statistical Analysis

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Lecture 1: Review and Exploratory Data Analysis (EDA)

Data Exploration Data Visualization

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

3: Summary Statistics

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Descriptive Statistics

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

Assignment #03: Time Management with Excel

Northumberland Knowledge

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Statistics Revision Sheet Question 6 of Paper 2

Using SPSS, Chapter 2: Descriptive Statistics

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Geostatistics Exploratory Analysis

Exploratory data analysis (Chapter 2) Fall 2011

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

Data exploration with Microsoft Excel: univariate analysis

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Exploratory Data Analysis

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Descriptive Statistics: Summary Statistics

MEASURES OF VARIATION

AP * Statistics Review. Descriptive Statistics

4 Other useful features on the course web page. 5 Accessing SAS

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Chapter 2 Data Exploration

Exploratory Data Analysis. Psychology 3256

COMPARISON MEASURES OF CENTRAL TENDENCY & VARIABILITY EXERCISE 8/5/2013. MEASURE OF CENTRAL TENDENCY: MODE (Mo) MEASURE OF CENTRAL TENDENCY: MODE (Mo)

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

CALCULATIONS & STATISTICS

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

Measures of Central Tendency and Variability: Summarizing your Data for Others

WEEK #22: PDFs and CDFs, Measures of Center and Spread

A Picture Really Is Worth a Thousand Words

Interpreting Data in Normal Distributions

Dongfeng Li. Autumn 2010

ADD-INS: ENHANCING EXCEL

Expression. Variable Equation Polynomial Monomial Add. Area. Volume Surface Space Length Width. Probability. Chance Random Likely Possibility Odds

Bellwork Students will review their study guide for their test. Box-and-Whisker Plots will be discussed after the test.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Using Excel for descriptive statistics

How To Write A Data Analysis

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

Diagrams and Graphs of Statistical Data

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Means, standard deviations and. and standard errors

MEASURES OF CENTER AND SPREAD MEASURES OF CENTER 11/20/2014. What is a measure of center? a value at the center or middle of a data set

1 Descriptive statistics: mode, mean and median

Statistics. Measurement. Scales of Measurement 7/18/2012

First Midterm Exam (MATH1070 Spring 2012)

GeoGebra Statistics and Probability

Description. Textbook. Grading. Objective

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Exercise 1.12 (Pg )

How To Test For Significance On A Data Set

LMS User Manual LMS Grade Book NUST LMS

Descriptive Statistics

Module 2: Introduction to Quantitative Data Analysis

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

MBA 611 STATISTICS AND QUANTITATIVE METHODS

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Module 3: Refining and Retrieving Data

Foundation of Quantitative Data Analysis

9.1 Measures of Center and Spread

Intro to Statistics 8 Curriculum

Week 11 Lecture 2: Analyze your data: Descriptive Statistics, Correct by Taking Log

a. mean b. interquartile range c. range d. median

Variables. Exploratory Data Analysis

2. Here is a small part of a data set that describes the fuel economy (in miles per gallon) of 2006 model motor vehicles.

Directions for Frequency Tables, Histograms, and Frequency Bar Charts

Chapter 1: Exploring Data

430 Statistics and Financial Mathematics for Business

GeoGebra. 10 lessons. Gerrit Stols

Section 1.3 Exercises (Solutions)

Introduction to Exploratory Data Analysis

An introduction to using Microsoft Excel for quantitative data analysis

Summarizing and Displaying Categorical Data

Mean = (sum of the values / the number of the value) if probabilities are equal

Introduction to Quantitative Methods

2 Describing, Exploring, and

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

IBM SPSS Statistics 20 Part 1: Descriptive Statistics

How Does My TI-84 Do That

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Fathom Dynamic Statistics Software Tutorial. McGraw-Hill Ryerson Mathematics of Data Management

How To Check For Differences In The One Way Anova

4. Descriptive Statistics: Measures of Variability and Central Tendency

Transcription:

Module 4: Data Exploration

Now that you have your data downloaded from the Streams Project database, the detective work can begin! Before computing any advanced statistics, we will first use descriptive statistics to examine the distribution of your data. The following topics are covered in this module: 1) Overview of descriptive statistics 2) Central tendency 3) Dispersion 4) Visualizing your data

Count Data Exploration 1) Overview of Data Exploration Calculating the descriptive statistics outlined in this module may be the extent of your analysis or the first step towards a more in-depth analysis as outlined in Module 5. Descriptive statistics help you describe your data in terms of its distribution. To examine your data s distribution you will need a measure of central tendency and dispersion. These measures are illustrated in the frequency histogram below which shows the count of data for each interval of 1 MPN (Most Probably Number) of E.coli helping you visualize the frequency of data occurring over the range of values in the dataset: The number of E.coli values that fall into a given range of MPN values 8 7 6 5 4 3 2 Central tendency measurements describe the numeric center of the data. 1 The dispersion of the data describes the spread of the data around the data s center 1 2 3 4 5 E.coli in MPN (Most Probably Number) Range of MPN values that define each interval * A histogram helps you visualize the frequency of values in your dataset occurring over a series of defined intervals. We will show you how to create a histogram later in the module.

Count Count Count Data Exploration 1) Overview of Data Exploration Before we calculate measure of central tendency and dispersion, let s look at what we mean by distribution. The ideal distribution of data is called the normal distribution. For a normal distribution, all measures of central tendency (you will see there are several!) are the same, and there is an equal number of observed data points on either side of these measures of central tendency. A histogram is a great way of initially visualizing your data s distribution because you can get a sense of the central tendency and dispersion of data around that center before calculating any statistics. The following histograms illustrate normal distributions as well as non-normal, or skewed, distributions for comparison: 8 7 6 5 4 3 2 1 1 2 3 4 5 E.coli in MPN Data is equally dispersed on either side of the data s center (peak of the histogram) 8 7 6 5 4 3 2 1 1 2 3 4 5 E.coli in MPN There is more data on the lower end of the data range with values tapering to the higher end 8 7 6 5 4 3 2 1 1 2 3 4 5 E.coli in MPN There is more data on the higher end of the data range with values tapering to the lower end Normal Distribution Left-Skewed Right-Skewed For advances statistical tests, it is important to determine if your distribution is normal or otherwise, as this will affect the type of statistical test you chose to use. The statistical tests in this tutorial assume your data is normally distributed.

2) Central tendency Ambrose et al. (22:22) describe central tendency as what usually happens If we measure E.coli at all our stream sites, what amount of E.coli do we usually measure? Having a measurement that reflects what usually happens allows us to compare individual sample data points to the usual value of data observed represented by a measure of central tendency. The following are three measurements of central tendency: Mean = the sum of observed data points divided by the number of data records Median = the middle data point (or average of the two middle data points if there is an even number of observations) when all data points are lined up in either ascending or descending order Mode = The most frequently occurring data point value in a dataset On the next page you will see examples of how these three measurements would be calculated for a sample E.coli dataset. Continued

2) Central tendency Data Point # E.Coli (MPN) 1 41 2 263 3 31 4 476 5 388 6 417 7 345 8 42 9 379 1 379 Sample dataset of E.coli values with units MPN Given our sample dataset, we would calculate the mean, median, and mode as follows: Mean = (41+263+31+476+388+417+345+42+379+381)/1 = 377 Median = 263, 31, 345, 379, 381, 388, 42, 41, 417, 476 > (381+388)/2 = Watch an additional example in Excel 385 Mode = 379 which occurs twice while other values occur only once 379 These values all suggest that the center of our data, or the most frequent values of E.coli measured in our streams, falls around 37 39 MPN. Watch and additional example in Excel Watch an additional example in Excel Continued

Frequency Frequency Frequency Frequency Frequency Data Exploration 3) Dispersion We have a measure of central tendency for our sample dataset (let s use the mean), but how do we describe how the rest of the data falls around our mean? If the following bell curves (think of them as smoothed out histograms) represent data with the centerline as your measure of central tendency, how do we describe the different ways that the data falls around their centerline? E.coli (MPN) E.coli (MPN) E.coli (MPN) E.coli (MPN) E.coli (MPN) Is the data distributed unevenly about the mean, with a few values far from the mean in one direction? Is the data distributed evenly on both side of the mean but concentrated around on mean? Is the data distributed evenly on both side of the mean with most data within a standard distance? Is the data distributed evenly on both sides of the mean dispersed over a greater range without much concentration by the mean? Is the data distributed unevenly about the mean, with a few values far from the mean in one direction? Now that you have a visual on what we mean by dispersion, the following page helps you calculate statistics used to quantify the nature of the dispersion around the mean. Continued

3) Dispersion The following are measurements of dispersion used to quantify the spread of data about the data s center: Range = The highest measurement the lowest measurement in the dataset Variance = A cumulative measure of individual data points distance from the mean. The following equation is used to calculate variance: Where: s 2 = x 2 ( x)2 n n 1 means you add everything that follows together. Remember to pay attention to parentheses they re important! s² = variance x = individual values in of the dataset n = the number of data points in the dataset Continued

3) Dispersion Standard Deviation = Somewhat of an average deviation of the data from the mean. Standard deviation is calculated as the square root of the variance: Where: s = x 2 ( x)2 n n 1 s = standard deviation x = individual values in of the dataset n = the number of data points in the dataset means you add everything that follows together. Remember to pay attention to parentheses they re important! On the next page you will see examples of how these three measurements would be calculated for our sample E.coli dataset. Continued

3) Dispersion Data Point # E.Coli (MPN) 1 41 2 263 3 31 4 476 5 388 6 417 7 345 8 42 9 379 1 379 Sample dataset of E.coli values with units MPN s 2 = s = Given our sample dataset, we would calculate the following values for range, variance, and standard deviation: Range = 476 (highest value) 263 (lowest value) = 213 Variance = 41 2 + 263 2 + 31 2 ( 41 + 263 + 31 )2 1 1 1 Standard deviation = 41 2 + 263 2 + 31 2 ( 41 + 263 + 31 )2 1 1 1 = 3528.1 = 59.4 Watch an additional example in Excel Watch an additional example in Excel Watch and additional example in Excel The standard deviation, variance and mean are all metrics commonly used in more advanced statistical tests, some of which will be described in Module 5.

4) Visualizing your data In the first part of this module you learned how to calculate two types of parameters to help you describe the distribution of your data: Measure of central tendency Dispersion When you are presenting these parameters, it is helpful to provide a visual of your data s distribution in addition the numbers. The remainder of this module will help you create a histogram and/or a boxplot depicting your data s distribution. The module will conclude by helping you describe your combined visual and statistical results.

Count Data Exploration 4) Visualizing your data First, we will revisit our frequency histogram, which shows the frequency of values in your dataset over a series of intervals that covers the range of your dataset. See the link below to see how to create a histogram for your data in excel. 4 3 E.colincy of E.coli Measurements A histogram illustrates the central tendency and the dispersion of a dataset by showing the frequency of data over a range of values The number of E.coli values that fall into a given range of MPN values 2 1 25-299 3-349 35-399 4-449 45-499 E.coli in MPN Range of MPN values that define each interval Click on the video icon to watch a video on how to create a histogram using Microsoft Excel

Phosphours (µg/l) Data Exploration 4) Visualizing your data A box plot also illustrates the distribution of your data. A box plots is made up of the following values derived from your dataset: median, minimum value, maximum value, quartile 1 value, and quartile 2 values. The following graph illustrates these components with descriptions following: 9 8 7 6 Maximum Value Quartile 3 5 4 3 2 1 = IQR: Q3 Q1 Median Quartile 1 Minimum Value Forested Site Definitions: Maximum Value = the largest value in the dataset Minimum Value = the smallest value in the dataset Median = the middle value of the dataset when all values are lined up either ascending or descending Quartile 1 = the median value of all values less than and excluding the median of the entire dataset Quartile 3 = the median value of all values greater than and excluding the median of the entire dataset Inter-Quartile Range (IQR) = Quartile 3 Quartile 1 Click on the video icon to watch a video on how to create a box-plot using Microsoft Excel

Box Plot Phosphours (µg/l) Phosphours (µg/l) Phosphours (µg/l) Histogram Count Count Count Data Exploration 4) Visualizing your Data Remember to discuss whether or not your data is normally distributed, or perhaps is skewed to the left or right. Your graphs, in addition to your descriptive statistics, help you communicate your findings, so be sure to include both! 8 7 6 5 4 3 2 1 1 2 3 4 5 8 7 6 5 4 3 2 1 1 2 3 4 5 8 7 6 5 4 3 2 1 1 2 3 4 5 E.coli in MPN E.coli in MPN E.coli in MPN Data is equally dispersed on either side of the data s center (peak of the histogram) 6 5 4 There is more data on the lower end of the data range with values tapering to the higher end 6 5 4 6 5 4 There is more data on the higher end of the data range with values tapering to the lower end 3 3 3 2 2 2 1 1 1 Normal Distribution Left-Skewed Right-Skewed Data is equally dispersed on either side of the median (center line of the box) The median is visibly closer two one the lower end of values The median is visibly closer two one the higher end of values Normal Distribution Left-Skewed Right-Skewed

It is often the case that after exploring your data with descriptive statistics you want to modify or refine your central question that s fine!

SUMMARY Descriptive statistics help describe your data s distribution A measure of central tendency and dispersion are needed to describe your data s distribution statistically Ideally your data fits the descriptions of a normal distribution with data distributed evenly on either side of the measure of central tendency. The following are measures of central tendency: mean, median and mode The following are measure of dispersion: range, variance, and standard deviation Histograms and box plots can help you illustrate your data s distribution Your descriptive statistics, histograms and/or box plots together help you describe the nature of your data After exploring your data using descriptive statistics it s good to reflect on your question and modify or refine it as needed.