Data Evaluation Why Important??

Similar documents
Introduction to Minitab and basic commands. Manipulating data in Minitab Describing data; calculating statistics; transformation.

Technical Guidance for Exploring TMDL Effectiveness Monitoring Data

Descriptive Statistics

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Projects Involving Statistics (& SPSS)

SPSS Tests for Versions 9 to 13

MEASURES OF LOCATION AND SPREAD

Monitoring Data Exploring Your Data, The First Step

II. DISTRIBUTIONS distribution normal distribution. standard scores

Part II Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Part II

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

UNIVERSITY OF NAIROBI

Using Excel for inferential statistics

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS

Additional sources Compilation of sources:

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

SPSS Explore procedure

Directions for using SPSS

Simple Predictive Analytics Curtis Seare

Statistical Analysis for Monotonic Trends

Statistics for Sports Medicine

Trend Analysis and Presentation

Part 2: Analysis of Relationship Between Two Variables

Development of Performance Measures. Task 3.1 Technical Memorandum. Determining Urban Stormwater Best Management Practice (BMP) Removal Efficiencies

Comparing Means in Two Populations

1 Nonparametric Statistics

Study Guide for the Final Exam


DATA INTERPRETATION AND STATISTICS

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Geostatistics Exploratory Analysis

Lesson 4 Measures of Central Tendency

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Exploratory Data Analysis. Psychology 3256

CALCULATIONS & STATISTICS

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Module 5: Statistical Analysis

Introduction to Quantitative Methods

Module 5: Multiple Regression Analysis

CORRELATIONAL ANALYSIS: PEARSON S r Purpose of correlational analysis The purpose of performing a correlational analysis: To discover whether there

How To Test For Significance On A Data Set

Post-hoc comparisons & two-way analysis of variance. Two-way ANOVA, II. Post-hoc testing for main effects. Post-hoc testing 9.

Statistics. Measurement. Scales of Measurement 7/18/2012

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

2. Filling Data Gaps, Data validation & Descriptive Statistics

Analyzing Research Data Using Excel

The Statistics Tutor s Quick Guide to

Simple linear regression

DATA ANALYSIS. QEM Network HBCU-UP Fundamentals of Education Research Workshop Gerunda B. Hughes, Ph.D. Howard University

Statistics Review PSY379

Chapter 12 Nonparametric Tests. Chapter Table of Contents

Lecture 1: Review and Exploratory Data Analysis (EDA)

SPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011

A (very) short course on the analysis of Water Quality Data

The Dummy s Guide to Data Analysis Using SPSS

Handbook for Developing Watershed Plans to Restore and Protect Our Waters

Descriptive Statistics

Nonparametric Two-Sample Tests. Nonparametric Tests. Sign Test

Exercise 1.12 (Pg )

Biostatistics: Types of Data Analysis

Chapter 7 Section 7.1: Inference for the Mean of a Population

Descriptive Statistics

Data Exploration Data Visualization

Parametric and Nonparametric: Demystifying the Terms

THE KRUSKAL WALLLIS TEST

Water Quality Data Analysis & R Programming Internship Central Coast Water Quality Preservation, Inc. March 2013 February 2014

EXPLORATORY DATA ANALYSIS: GETTING TO KNOW YOUR DATA

EPS 625 INTERMEDIATE STATISTICS FRIEDMAN TEST

ADD-INS: ENHANCING EXCEL

NONPARAMETRIC STATISTICS 1. depend on assumptions about the underlying distribution of the data (or on the Central Limit Theorem)

HYPOTHESIS TESTING WITH SPSS:

1 Quality Assurance and Quality Control Project Plan

SAS/STAT. 9.2 User s Guide. Introduction to. Nonparametric Analysis. (Book Excerpt) SAS Documentation

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

consider the number of math classes taken by math 150 students. how can we represent the results in one number?

Statistics. One-two sided test, Parametric and non-parametric test statistics: one group, two groups, and more than two groups samples

Recall this chart that showed how most of our course would be organized:

Instructions for SPSS 21

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Diagrams and Graphs of Statistical Data

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Rank-Based Non-Parametric Tests

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Come scegliere un test statistico

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

Measures of Central Tendency and Variability: Summarizing your Data for Others

EXPLORING SPATIAL PATTERNS IN YOUR DATA

Univariate Regression

SPSS Guide How-to, Tips, Tricks & Statistical Techniques

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Nonparametric Statistics

Research Methods & Experimental Design

Statistical And Trend Analysis Of Rainfall And River Discharge: Yala River Basin, Kenya

3.2 Statistical Analysis Procedures

Transcription:

Data Evaluation Why important? Questions answered depends on data collected Data format & storage (electronic / hard copies) Where to begin with examining the raw data Exploratory analysis Dealing with the real world (missing, below detection limit, non-normal, normal, autocorrelation) Natural variability (e.g., season, hydrology / meteorology) Statistical Approaches for Assessment and detecting trends Data analysis resources in notebook

Data Evaluation Why Important?? Get informational value out of data collect Communicate results in summary format that relates to questions want to answer Make correct analysis and interpretations of data Make sure everyone s on same page wrt information to be gained from monitoring Move future monitoring forward in direction that best meets objectives

Questions Answered Match Data Collected For Example: Frequency Explanatory Variables?? (e.g., Flow / stage / rainfall / season / Land use?) Question Answered 1 grab sample None Synoptic Even interval for multi-years (e.g., biweekly, quarterly) Yes (essential) No Screening for potential follow-up snap shot watershed assessment Not much Even interval for multi-years (e.g., biweekly, quarterly) Storm water samples Yes Yes Long-term trends (e.g., adjusted concentrations, biological health) Loads/watershed assessments (long term trends if sampling sustained)

Questions Asked (Con t( Con t): Linking Water Quality and Land Treatment / Use Watershed Experimental Design Essential Land Treatment / Land Use and Water Quality Monitoring Explanatory Variables to Isolate Water Quality Trends due to BMPs Match LT and WQ Data Hydrologic (Spatial) Time Basis (Temporal) - Multi - years

Data Evaluation Data Format and Storage Collect data in format / layout similar to computer entry (and visa versa) e.g., forms Include date, location, time, etc. on ALL records (not just in file name) Allows for more analysis flexibility Minimizes errors in data identification Make unique data fields that can sort upon (e.g., site, date, comments / data flags, TP) Don t combine fields - <.01 not good

Example Spread Sheet Data Entry Site NO2+3 TKN TP TSS TS FC FC_flag FS mg/l mg/l mg/l mg/l mg/l --- mpn/100ml --- 05-Apr-93 E 1.32 23.00 5.20 194 566 470000 340000 n 13-Apr-93 E 3.29 2.50 0.48 8 138 6400 1600 n 21-Apr-93 E 3.28 1.80 0.33 7 80 17000 3000 n 27-Apr-93 E 3.27 11.00 2.80 67 262 580000 130000 n 04-May-93 E 2.98 1.40 0.55 4 115 66000 > 23000 n 11-May-93 E 3.13 2.80 0.71 10 122 73000 e 18000 e 18-May-93 E 2.88 3.50 0.94 13 140 97000 e 25000 n 25-May-93 E 0.07 37.00 4.20 72 409 930000 e 560000 n 02-Jun-93 E 1.40 4.20 1.18 35 144 200000 68000 > 08-Jun-93 E 1.56 2.30 0.72 21 113 59000 27000 n

Data Evaluation Data Format and Storage (con t) Build in data entry QA (e.g., allowable: minimal / maximum values Character vs. numeric Keep hard copies (remember the card readers, or CPM operating system (8 floppies) Have data entry fields for Field Observations or narrative Back-ups, back-ups, back-ups

Data Evaluation Exploratory Data Analysis Check for data entry errors: Minimum / Maximum / Average values to check for exceptionally high / low values ( outliers ) Box and Whisker Plots (Box Plots) to check for exceptionally high / low values and highly skewed data Time Plots or Time Series Plots. Plot data values vs. time to visually examine for unreasonable data Skewness Tests (e.g., Proc Univariate in SAS or Data Analysis Tools in Excel)

Data Evaluation Exploratory Data Analysis (Con t) Check for data distribution attributes : Normality Tests Test for departure from normal distribution or Bell-shaped curve (e.g., PROC UNIVARIATE in SAS or Data Analysis Tools in Excel) Skewness Tests. Test for long tails (e.g., PROC UNIVARIATE in SAS or Data Analysis Tools in Excel) Time Plots or Time Series Plots. Visually examine for seasonality, autocorrelation Autocorrelation tests.. (e.g., PROC AUTOREG in SAS)

Real World Dealing with Outliers Do you throw them out?? iff Perhaps you can trace the error back to data entry, lab or field QA/QC problem KEEP ELSE -- these may be where the real information is held

Real World Dealing with Below DL values BEST: Use the actual instrument values (could be negative), reflects variability and distribution at lower range. (In hard copy reports, use DL values with Less DL flag). If <20% below DL: Can substitute ½ value of DL (e.g., if DL is 0.01 mg/l, then substitute 0.005 mg/l). BUT, if value is really 0.01, do not change (the value of a flag variable for DL). D Else: analysis Use alternative statistical analysis, e.g., Frequency Else: Generate synthetic data that mimics distribution at tail

Real World Dealing with Missing values BEST: Have sufficient data frequency and use rest of data values for analysis Substitution: e.g., Regression analysis: plot values of TP concentration vs. stream flow. If there is a good correlation, calculate estimated values for missing TP concentration when discharge is known. USE SPARINGLY Aggregation: Combine data over time intervals (e.g., weekly averages, annual averages)

Real World Dealing with Non-Normality Normality Data Transformation: Log(X). The log-normal distribution (i.e., the log transformed data has a normal distribution) is very common for water quality pollutant concentration data. An attribute of the data is that there are e a few high values in the tail. Utilize Non-parametric Statistical Analyses: However, doesn t cure all problems..

Parametric vs. Nonparametric Mean = Central Tendency Symmetrical Distribution about Mean (usually Normal) LogNormal and Slightly skewed OK Must Adjust for : - Autocorrelation (easy) - Seasonal Differences (easy) - Variance Heterogeneity (doable) - Hydrology, flow (easy) Versatile, Excellent for: - Assessments of variability - Step Trends - Linear Trends - Ramp Trends Median = Central Tendency Normality Not Required Skewed and Outlier Data OK Must Adjust for : - Autocorrelation (doable) - Seasonal Differences (easy) - Variance Heterogeneity (difficult) - Hydrology, flow (2-steps) Excellent for: - Assessments of variability - Step Trends - Monotonic Trends

Real World Dealing with Autocorrelation Time Series Analysis e.g., PROC AUTOREG in SAS (appropriate for weekly, monthly data) Useful in regression relationships (e.g., time trends, correlation between sites such as paired watersheds or upstream/downstream) Note: Spatial autocorrelation analysis methods available Aggregate Data: Average into larger time steps (e.g., quarterly, annually). Problem with potential loss of degrees of freedom.

Real World Dealing with Seasonality Use Explanatory Variable (covariate) Adjustment with measured variables dealing with hydrology / meteorological changes to adjust for seasonal changes, e.g.: TP concentration adjusted for stream discharge by including discharge as an X variable in trend analysis Normalize: e.g, adjusting the load value to average storm discharge level to allow comparison across storms Normalize: Model seasonal cycles into analysis: e.g, Indicator variables ( 0 or 1 ) for each month/season Sinusoidal Models

Natural Variability What s in a MEAN Central Tendency Good summary statistic Doesn t tell the full story The Fallacy of the Mean Doesn t show range or variability Hard to show statistically significant differences between mean values without variance Non-robust to extremes

Natural Variability Variability is our Friend Use to determine Minimal Detectable Changes (MDC) or differences Find the goods and the bads bads Avoid unrealistic expectation of good or bad conditions Recognize that year-to to-year variability can be LARGE

Natural Variability Utilize Explanatory Variables / Covariates to minimize unexplained variability and assist with making correct data interpretations), such as: Land use Stream flow / discharge / stage height Precipitation Ground water table depth Temperature Season Upstream conditions

Waukegan River, Illinois IBI (e.g., IBG I B Guessing ) Y IBI 40 35 30 25 20 Pre Treatment Post Control 15 0 25 50 75 100 125 Elapsed Months Y S1 S2

Statistical Analysis Toolbox No witch hunts allowed Pre-planned questions only Utilize statistical test(s) that address at questions / objective (e.g., assessments of central tendency and variability, step change, gradual change) Utilize multiple statistical approaches and graphical presentations

Statistical Distribution Assessment Box and whisker plots Mean & variance / standard deviation Median & percentile analysis Frequency distribution analysis (e.g., Percent of data in 25 percentile, 50 percentile, 75 percentile. Percent exceedance of standard

BMP Effectiveness: An Example Across Sites / Studies (e.g., multiple watersheds) % load reductio n Changes in Sediment Load - Conservation Tillage 200 100 0-100 -200-300 Range and Mean Lowest Highest Mean % load reductio n Changes in Total P Load - Conservation Tillage 120 100 80 60 40 20 0-20 -40-60 -80-100 Range and Mean Lowest Highest Mean

Correlation between variables (e.g., TSS and Turbidity, Long Creek) Correlation Between TSS & Turbidity 2000 1500 TSS 1000 500 Y Predicted Y 0-500 0.00 200.00 400.00 600.00 800.00 1000.0 0 Turbidity 1200.0 0 TSS vs. Turbidity, Long Creek, Site E log(tss) 4.00 2.00 0.00 0.00-2.00 1.00 2.00 3.00 4.00 Log(Turbidity) Y Predicted Y

Statistical Approaches Comparisons Between Locations Parametric: T-test (compare mean values between 2 groups) Analysis of variance, AVOVA (compare more than 2 groups) Analysis of covariance (addition of explanatory variable can be continuous variable such as stream flow Non-Parametric Wilcoxon Rank Sum (~T-test) Kruskal-Wallis k-sample k (~ANOVA)

Statistical Approaches Step Trend Comparison between 2 time periods Parametric Tests: T-Test (Non-Paired or Paired) Paired is usually more powerful Analysis of Variance Analysis of Covariance

Statistical Approaches Step Trend Non-Parametric Tests: Step Trend (Non-Paired) Wilcoxon Rank Sum Test Seasonal Wilcoxon Rank Sum Test Kruskal-Wallis k-sample (compares more than 2 groups, ~Analysis of Variance) Step Trend (Paired Differences) Wilcoxon Signed Rank Test

Statistical Approaches Continuous Trend Parametric Tests: Linear Regression Add explanatory variables (covariates) were appropriate Can ADD dummy variable to mimic ramp Analysis of Covariance (e.g., adjustment for upstream concentration or control watershed) Time Series Analysis

Statistical Approaches Continuous Trend Non-Parametric Tests: Correlation Spearman's Rank Correlation (Spearman's rho) Monotonic Trends Kendall's tau (Mann-Kendall, Kendall Rank Correlation) Seasonal Kendall Test 2-step process by 1) calculating residuals from linear regression of concentration vs. discharge; 2) utilize residuals (adjusted values) in one of above tests Contingency Table (e.g., Cochran-Mantel-Haenszel (CMH) statistics

Long Creek, NC 319 NMP 83%, 76, 78, and 33% reductions in sediment, TP, TKN, Nitrate-N N loads, respectively (upstream/downstream before/after design)

Long Creek, NC (See NWQEP NOTES, July 1999, Figure 7 for SAS program to test for trends in downstream after adjusting for upsteam 7.0 of BMPs 6.0 downstream treatment upstream control log weekly TSS load, lbs 5.0 4.0 3.0 2.0 1.0 0.0-1.0-2.0 1 21 41 61 81 101 121 141 161 Week

Section 319 NMP Projects Morro Bay, California 4-H H Watershed Model, Youth Education