2. Filling Data Gaps, Data validation & Descriptive Statistics



Similar documents
Exercise 1.12 (Pg )

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

3.2 Measures of Spread

Descriptive Statistics

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

3: Summary Statistics

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory Data Analysis. Psychology 3256

Means, standard deviations and. and standard errors

Lesson 4 Measures of Central Tendency

Final Exam Practice Problem Answers

Quantitative Methods for Finance

Mean = (sum of the values / the number of the value) if probabilities are equal

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Lecture 1: Review and Exploratory Data Analysis (EDA)

DATA INTERPRETATION AND STATISTICS

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median


4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Data exploration with Microsoft Excel: univariate analysis

Diagrams and Graphs of Statistical Data

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Copyright PEOPLECERT Int. Ltd and IASSC

Geostatistics Exploratory Analysis

Data exploration with Microsoft Excel: analysing more than one variable

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Standard Deviation Estimator

Chapter 3. The Normal Distribution

Dongfeng Li. Autumn 2010

Simple linear regression

Introduction; Descriptive & Univariate Statistics

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

430 Statistics and Financial Mathematics for Business

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

A Correlation of. to the. South Carolina Data Analysis and Probability Standards

CALCULATIONS & STATISTICS

International College of Economics and Finance Syllabus Probability Theory and Introductory Statistics

Measures of Central Tendency and Variability: Summarizing your Data for Others

03 The full syllabus. 03 The full syllabus continued. For more information visit PAPER C03 FUNDAMENTALS OF BUSINESS MATHEMATICS

Ch. 3.1 # 3, 4, 7, 30, 31, 32

Descriptive Statistics

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Introduction to Statistics for Psychology. Quantitative Methods for Human Sciences

ADD-INS: ENHANCING EXCEL

Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Exploratory Data Analysis

Descriptive Statistics

CHAPTER THREE COMMON DESCRIPTIVE STATISTICS COMMON DESCRIPTIVE STATISTICS / 13

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Variables. Exploratory Data Analysis

CA200 Quantitative Analysis for Business Decisions. File name: CA200_Section_04A_StatisticsIntroduction

5. Linear Regression

How To Write A Data Analysis

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

1 Descriptive statistics: mode, mean and median

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Statistics. Measurement. Scales of Measurement 7/18/2012

STAT355 - Probability & Statistics

Statistical Functions in Excel

An introduction to using Microsoft Excel for quantitative data analysis

Data Exploration Data Visualization

Topic 9 ~ Measures of Spread

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

Investment Statistics: Definitions & Formulas

MTH 140 Statistics Videos

Information Technology Services will be updating the mark sense test scoring hardware and software on Monday, May 18, We will continue to score

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

2. Descriptive statistics in EViews

Instruction Manual for SPC for MS Excel V3.0

Shape of Data Distributions

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Econ 132 C. Health Insurance: U.S., Risk Pooling, Risk Aversion, Moral Hazard, Rand Study 7

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

z-scores AND THE NORMAL CURVE MODEL

Implications of Big Data for Statistics Instruction 17 Nov 2013

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Technical Efficiency Accounting for Environmental Influence in the Japanese Gas Market

Mathematical goals. Starting points. Materials required. Time needed

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Week 1. Exploratory Data Analysis

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Univariate Regression

Bill Burton Albert Einstein College of Medicine April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

List of Examples. Examples 319

9 Hedging the Risk of an Energy Futures Portfolio UNCORRECTED PROOFS. Carol Alexander 9.1 MAPPING PORTFOLIOS TO CONSTANT MATURITY FUTURES 12 T 1)

Projects Involving Statistics (& SPSS)

Chapter 4. Probability Distributions

Foundation of Quantitative Data Analysis

How to Verify Performance Specifications

Fairfield Public Schools

Risk and return (1) Class 9 Financial Management,

Transcription:

2. Filling Data Gaps, Data validation & Descriptive Statistics Dr. Prasad Modak

Background Data collected from field may suffer from these problems Data may contain gaps ( = no readings during this period) Data may exhibit suspect values Addressing these deficiencies is essential before processing/ analyzing the data 2

How to fill in data gaps Identify data gaps Use any of the following techniques: Use the average of the distribution to fill in data gaps Taking mean of adjacent values Linear interpolation Using a Regression Model Remember that there is no substitute to real value Filling missing values creates a bias and distortion in the data when it comes to interpretation 3

Techniques for Filling Missing Values Suppose x a and x c are two values with value of x b missing, then: x Linear interpolation. Suppose x a and x b are the two adjacent values to n missing values. Then the k th missing value (from x a ) will have the value: x b k = = x x a a + x 2 + k c x b n x a If there m data points missing out of n and m << n, then assign to m, average value of the series. 4

Outliers What are outliers? How does the outliers affect analysis of data? How to find outliers Dixon s 4 sigma (σ) test Outliers are those values that lie beyond the mean ± 4 standard deviations Box and whisker diagram Use water quality standards (IS 10500:1991) 5

Date & Time Hardness 3/26/2004 13:00 122.3 3/26/2004 14:00 178.6 3/26/2004 15:00 347.4 3/26/2004 18:00 368.3 3/26/2004 19:00 67942 3/26/2004 20:00 22175.7 3/26/2004 21:00 5875.2 3/26/2004 22:00 1840.4 3/26/2004 23:00 840.6 3/27/2004 0:00 643 3/27/2004 1:00 464.8 3/27/2004 2:00 275.2 3/27/2004 3:00 194.6 3/27/2004 4:00 168.2 3/27/2004 5:00 162.1 3/27/2004 6:00 162 3/27/2004 7:00 172.4 3/27/2004 8:00 298.8 3/27/2004 9:00 334 3/27/2004 10:00 398.3 3/27/2004 11:00 394.8 3/27/2004 12:00 480.9 µ 4719.98 σ 14893.7 µ+4σ 64294.8 µ-4σ -54855 Q1 182.6 Q3 602.475 Max Dixons test 67942 Min 122.3 hardness (mg/l) 1800 1600 1400 1200 1000 800 600 400 200 0 Box and Whisker Diagram A B C Locations Time A B C t 313 202 245 t+1 356 220 123 t+2 298 312 332 t+3 1200 111 321 t+4 200 20 234 t+5 398 450 234 t+6 512 367 567 6

as per IS 10500: 1991 day ph Lower limit Lower limit 10-Feb-09 6.92 6.5 8.5 11-Feb-09 6.89 6.5 8.5 12-Feb-09 5.21 6.5 8.5 13-Feb-09 6.34 6.5 8.5 14-Feb-09 6.40 6.5 8.5 15-Feb-09 7.11 6.5 8.5 16-Feb-09 6.90 6.5 8.5 17-Feb-09 7.21 6.5 8.5 18-Feb-09 6.98 6.5 8.5 19-Feb-09 8.10 6.5 8.5 20-Feb-09 6.23 6.5 8.5 21-Feb-09 8.11 6.5 8.5 22-Feb-09 6.54 6.5 8.5 23-Feb-09 6.66 6.5 8.5 24-Feb-09 7.32 6.5 8.5 25-Feb-09 7.90 6.5 8.5 26-Feb-09 7.21 6.5 8.5 27-Feb-09 6.09 6.5 8.5 28-Feb-09 7.54 6.5 8.5 1-Mar-09 5.00 6.5 8.5 2-Mar-09 9.21 6.5 8.5 3-Mar-09 7.81 6.5 8.5 4-Mar-09 8.20 6.5 8.5 5-Mar-09 1.20 6.5 8.5 6-Mar-09 6.87 6.5 8.5 7-Mar-09 7.24 6.5 8.5 8-Mar-09 7.91 6.5 8.5 9-Mar-09 8.19 6.5 8.5 Control Chart 7

No. of observations (n) No. of missing values Minimum Maximum Range 1 st Quartile Median 3 rd Quartile Sum (If relevant) Mean (µ) Standard error (σ 2 ) Standard deviation (σ) Variance (v) Skewness Kurtosis (k) Descriptive statistics 8

Arithmetical mean Mean is the arithmetic mean or the average of the data. It is calculated using the following formula: Mean = (Sum of data / Total number of observations) Mean is important, why? Mean is the most used measure of a distribution Other measures of dispersion (SE. SD, CV) are calculated based on mean Mean might not be always a very good measure of data, why? Two distributions with same mean could have widely different ranges Mean might not reflect the right attributes of the distribution (in case of highly tilted distributions) 9

Median, quartiles and percentiles a quartile is any of the three values which divide the sorted data set into four equal parts, so that each part represents one fourth of the sampled population Q1 = first quartile (25% data points are under this value), Q2 = second quartile (or median) (50% data points in sample is below this value) Q3 = third quartile (75% data points in sample are below this value) Median is a real value located in the sample Sample mean and median could be different Similarly, percentiles divide the population into 100 equal unit So, median = 2 nd Quartile = 50 th Percentile 10

Median x med = n 2 x med = n +1 2 11

12

Standard Error and Standard Deviation A Standard Error (SE) is the estimate of variation of a statistics The SE of the mean tell about the spread of sample observations (x) about the mean (µ). The Standard Deviation (SD) is a widely used measure of the variability or dispersion along the mean It shows how much variation there is from the average or measuring how far the data values lie from the mean Chebyshev s rule: for any distribution, and for any positive k, the proportion of the data that lies within k nos. SD of the mean is at least: http://www.ltcconline.net/greenl/courses/201/descstat/mean.htm p = 1 1 k 2 13

How to calculate SE & SD Sample standard deviation ( ) = x x σ n 1 2 Sample standard error of the mean SE = σ n 14

day ph x-µ (x-µ) 2 10-Feb-09 6.92 0.02 0.00 11-Feb-09 6.89-0.01 0.00 12-Feb-09 5.21-1.69 2.87 13-Feb-09 6.34-0.56 0.32 14-Feb-09 6.40-0.50 0.25 15-Feb-09 7.11 0.21 0.04 16-Feb-09 6.90 0.00 0.00 17-Feb-09 7.21 0.31 0.09 18-Feb-09 6.98 0.08 0.01 19-Feb-09 8.10 1.20 1.43 20-Feb-09 6.23-0.67 0.45 21-Feb-09 8.11 1.21 1.46 22-Feb-09 6.54-0.36 0.13 23-Feb-09 6.66-0.24 0.06 24-Feb-09 7.32 0.42 0.17 25-Feb-09 7.90 1.00 0.99 26-Feb-09 7.21 0.31 0.09 27-Feb-09 6.09-0.81 0.66 28-Feb-09 7.54 0.64 0.41 1-Mar-09 5.00-1.90 3.62 2-Mar-09 9.21 2.31 5.32 3-Mar-09 7.81 0.91 0.82 4-Mar-09 8.20 1.30 1.68 5-Mar-09 1.20-5.70 32.53 6-Mar-09 6.87-0.03 0.00 7-Mar-09 7.24 0.34 0.11 8-Mar-09 7.91 1.01 1.01 9-Mar-09 8.19 1.29 1.66 56.20 n = 28 Var 2.08 SD 1.442732 SE 0.272651 count (n) 28 Max 9.21 Min 1.2 Range 1.2-9.21 Mean 6.90 Q1 6.51 Median 7.05 Q3 7.83 Variance 2.08 Std. error 0.27 Std. dev 1.44 Example 15

Discussion Points Mean, SE and Compliance with Standards Annual average DO concentration at a station was found to be 4.5 mg/l with SE of 1.0. Is the station compliant? SE and Number of Samples (Sampling Frequency) Sampling frequency was increased to weekly. What will happen now to annual average and SE? Coefficient of Variation (CV) = SD / Mean What will CV signify? What inferences could we draw? 16