Data Evaluation Why important? Questions answered depends on data collected Data format & storage (electronic / hard copies) Where to begin with examining the raw data Exploratory analysis Dealing with the real world (missing, below detection limit, non-normal, normal, autocorrelation) Natural variability (e.g., season, hydrology / meteorology) Statistical Approaches for Assessment and detecting trends Data analysis resources in notebook
Data Evaluation Why Important?? Get informational value out of data collect Communicate results in summary format that relates to questions want to answer Make correct analysis and interpretations of data Make sure everyone s on same page wrt information to be gained from monitoring Move future monitoring forward in direction that best meets objectives
Questions Answered Match Data Collected For Example: Frequency Explanatory Variables?? (e.g., Flow / stage / rainfall / season / Land use?) Question Answered 1 grab sample None Synoptic Even interval for multi-years (e.g., biweekly, quarterly) Yes (essential) No Screening for potential follow-up snap shot watershed assessment Not much Even interval for multi-years (e.g., biweekly, quarterly) Storm water samples Yes Yes Long-term trends (e.g., adjusted concentrations, biological health) Loads/watershed assessments (long term trends if sampling sustained)
Questions Asked (Con t( Con t): Linking Water Quality and Land Treatment / Use Watershed Experimental Design Essential Land Treatment / Land Use and Water Quality Monitoring Explanatory Variables to Isolate Water Quality Trends due to BMPs Match LT and WQ Data Hydrologic (Spatial) Time Basis (Temporal) - Multi - years
Data Evaluation Data Format and Storage Collect data in format / layout similar to computer entry (and visa versa) e.g., forms Include date, location, time, etc. on ALL records (not just in file name) Allows for more analysis flexibility Minimizes errors in data identification Make unique data fields that can sort upon (e.g., site, date, comments / data flags, TP) Don t combine fields - <.01 not good
Example Spread Sheet Data Entry Site NO2+3 TKN TP TSS TS FC FC_flag FS mg/l mg/l mg/l mg/l mg/l --- mpn/100ml --- 05-Apr-93 E 1.32 23.00 5.20 194 566 470000 340000 n 13-Apr-93 E 3.29 2.50 0.48 8 138 6400 1600 n 21-Apr-93 E 3.28 1.80 0.33 7 80 17000 3000 n 27-Apr-93 E 3.27 11.00 2.80 67 262 580000 130000 n 04-May-93 E 2.98 1.40 0.55 4 115 66000 > 23000 n 11-May-93 E 3.13 2.80 0.71 10 122 73000 e 18000 e 18-May-93 E 2.88 3.50 0.94 13 140 97000 e 25000 n 25-May-93 E 0.07 37.00 4.20 72 409 930000 e 560000 n 02-Jun-93 E 1.40 4.20 1.18 35 144 200000 68000 > 08-Jun-93 E 1.56 2.30 0.72 21 113 59000 27000 n
Data Evaluation Data Format and Storage (con t) Build in data entry QA (e.g., allowable: minimal / maximum values Character vs. numeric Keep hard copies (remember the card readers, or CPM operating system (8 floppies) Have data entry fields for Field Observations or narrative Back-ups, back-ups, back-ups
Data Evaluation Exploratory Data Analysis Check for data entry errors: Minimum / Maximum / Average values to check for exceptionally high / low values ( outliers ) Box and Whisker Plots (Box Plots) to check for exceptionally high / low values and highly skewed data Time Plots or Time Series Plots. Plot data values vs. time to visually examine for unreasonable data Skewness Tests (e.g., Proc Univariate in SAS or Data Analysis Tools in Excel)
Data Evaluation Exploratory Data Analysis (Con t) Check for data distribution attributes : Normality Tests Test for departure from normal distribution or Bell-shaped curve (e.g., PROC UNIVARIATE in SAS or Data Analysis Tools in Excel) Skewness Tests. Test for long tails (e.g., PROC UNIVARIATE in SAS or Data Analysis Tools in Excel) Time Plots or Time Series Plots. Visually examine for seasonality, autocorrelation Autocorrelation tests.. (e.g., PROC AUTOREG in SAS)
Real World Dealing with Outliers Do you throw them out?? iff Perhaps you can trace the error back to data entry, lab or field QA/QC problem KEEP ELSE -- these may be where the real information is held
Real World Dealing with Below DL values BEST: Use the actual instrument values (could be negative), reflects variability and distribution at lower range. (In hard copy reports, use DL values with Less DL flag). If <20% below DL: Can substitute ½ value of DL (e.g., if DL is 0.01 mg/l, then substitute 0.005 mg/l). BUT, if value is really 0.01, do not change (the value of a flag variable for DL). D Else: analysis Use alternative statistical analysis, e.g., Frequency Else: Generate synthetic data that mimics distribution at tail
Real World Dealing with Missing values BEST: Have sufficient data frequency and use rest of data values for analysis Substitution: e.g., Regression analysis: plot values of TP concentration vs. stream flow. If there is a good correlation, calculate estimated values for missing TP concentration when discharge is known. USE SPARINGLY Aggregation: Combine data over time intervals (e.g., weekly averages, annual averages)
Real World Dealing with Non-Normality Normality Data Transformation: Log(X). The log-normal distribution (i.e., the log transformed data has a normal distribution) is very common for water quality pollutant concentration data. An attribute of the data is that there are e a few high values in the tail. Utilize Non-parametric Statistical Analyses: However, doesn t cure all problems..
Parametric vs. Nonparametric Mean = Central Tendency Symmetrical Distribution about Mean (usually Normal) LogNormal and Slightly skewed OK Must Adjust for : - Autocorrelation (easy) - Seasonal Differences (easy) - Variance Heterogeneity (doable) - Hydrology, flow (easy) Versatile, Excellent for: - Assessments of variability - Step Trends - Linear Trends - Ramp Trends Median = Central Tendency Normality Not Required Skewed and Outlier Data OK Must Adjust for : - Autocorrelation (doable) - Seasonal Differences (easy) - Variance Heterogeneity (difficult) - Hydrology, flow (2-steps) Excellent for: - Assessments of variability - Step Trends - Monotonic Trends
Real World Dealing with Autocorrelation Time Series Analysis e.g., PROC AUTOREG in SAS (appropriate for weekly, monthly data) Useful in regression relationships (e.g., time trends, correlation between sites such as paired watersheds or upstream/downstream) Note: Spatial autocorrelation analysis methods available Aggregate Data: Average into larger time steps (e.g., quarterly, annually). Problem with potential loss of degrees of freedom.
Real World Dealing with Seasonality Use Explanatory Variable (covariate) Adjustment with measured variables dealing with hydrology / meteorological changes to adjust for seasonal changes, e.g.: TP concentration adjusted for stream discharge by including discharge as an X variable in trend analysis Normalize: e.g, adjusting the load value to average storm discharge level to allow comparison across storms Normalize: Model seasonal cycles into analysis: e.g, Indicator variables ( 0 or 1 ) for each month/season Sinusoidal Models
Natural Variability What s in a MEAN Central Tendency Good summary statistic Doesn t tell the full story The Fallacy of the Mean Doesn t show range or variability Hard to show statistically significant differences between mean values without variance Non-robust to extremes
Natural Variability Variability is our Friend Use to determine Minimal Detectable Changes (MDC) or differences Find the goods and the bads bads Avoid unrealistic expectation of good or bad conditions Recognize that year-to to-year variability can be LARGE
Natural Variability Utilize Explanatory Variables / Covariates to minimize unexplained variability and assist with making correct data interpretations), such as: Land use Stream flow / discharge / stage height Precipitation Ground water table depth Temperature Season Upstream conditions
Waukegan River, Illinois IBI (e.g., IBG I B Guessing ) Y IBI 40 35 30 25 20 Pre Treatment Post Control 15 0 25 50 75 100 125 Elapsed Months Y S1 S2
Statistical Analysis Toolbox No witch hunts allowed Pre-planned questions only Utilize statistical test(s) that address at questions / objective (e.g., assessments of central tendency and variability, step change, gradual change) Utilize multiple statistical approaches and graphical presentations
Statistical Distribution Assessment Box and whisker plots Mean & variance / standard deviation Median & percentile analysis Frequency distribution analysis (e.g., Percent of data in 25 percentile, 50 percentile, 75 percentile. Percent exceedance of standard
BMP Effectiveness: An Example Across Sites / Studies (e.g., multiple watersheds) % load reductio n Changes in Sediment Load - Conservation Tillage 200 100 0-100 -200-300 Range and Mean Lowest Highest Mean % load reductio n Changes in Total P Load - Conservation Tillage 120 100 80 60 40 20 0-20 -40-60 -80-100 Range and Mean Lowest Highest Mean
Correlation between variables (e.g., TSS and Turbidity, Long Creek) Correlation Between TSS & Turbidity 2000 1500 TSS 1000 500 Y Predicted Y 0-500 0.00 200.00 400.00 600.00 800.00 1000.0 0 Turbidity 1200.0 0 TSS vs. Turbidity, Long Creek, Site E log(tss) 4.00 2.00 0.00 0.00-2.00 1.00 2.00 3.00 4.00 Log(Turbidity) Y Predicted Y
Statistical Approaches Comparisons Between Locations Parametric: T-test (compare mean values between 2 groups) Analysis of variance, AVOVA (compare more than 2 groups) Analysis of covariance (addition of explanatory variable can be continuous variable such as stream flow Non-Parametric Wilcoxon Rank Sum (~T-test) Kruskal-Wallis k-sample k (~ANOVA)
Statistical Approaches Step Trend Comparison between 2 time periods Parametric Tests: T-Test (Non-Paired or Paired) Paired is usually more powerful Analysis of Variance Analysis of Covariance
Statistical Approaches Step Trend Non-Parametric Tests: Step Trend (Non-Paired) Wilcoxon Rank Sum Test Seasonal Wilcoxon Rank Sum Test Kruskal-Wallis k-sample (compares more than 2 groups, ~Analysis of Variance) Step Trend (Paired Differences) Wilcoxon Signed Rank Test
Statistical Approaches Continuous Trend Parametric Tests: Linear Regression Add explanatory variables (covariates) were appropriate Can ADD dummy variable to mimic ramp Analysis of Covariance (e.g., adjustment for upstream concentration or control watershed) Time Series Analysis
Statistical Approaches Continuous Trend Non-Parametric Tests: Correlation Spearman's Rank Correlation (Spearman's rho) Monotonic Trends Kendall's tau (Mann-Kendall, Kendall Rank Correlation) Seasonal Kendall Test 2-step process by 1) calculating residuals from linear regression of concentration vs. discharge; 2) utilize residuals (adjusted values) in one of above tests Contingency Table (e.g., Cochran-Mantel-Haenszel (CMH) statistics
Long Creek, NC 319 NMP 83%, 76, 78, and 33% reductions in sediment, TP, TKN, Nitrate-N N loads, respectively (upstream/downstream before/after design)
Long Creek, NC (See NWQEP NOTES, July 1999, Figure 7 for SAS program to test for trends in downstream after adjusting for upsteam 7.0 of BMPs 6.0 downstream treatment upstream control log weekly TSS load, lbs 5.0 4.0 3.0 2.0 1.0 0.0-1.0-2.0 1 21 41 61 81 101 121 141 161 Week
Section 319 NMP Projects Morro Bay, California 4-H H Watershed Model, Youth Education