A (very) short course on the analysis of Water Quality Data

Transcription

1 A (very) short course on the analysis of Water Quality Data Carl James Schwarz Department of Statistics and Actuarial Science Simon Fraser University Burnaby, BC, Canada stat.sfu.ca 1 / 118

2 Introduction Introduction 2 / 118

3 Introduction Objectives: Review the concepts of mean, standard deviations, standard errors, confidence intervals and percentiles in terms of water quality studies. Show how compare water quality using synoptic reading at two or more sites. Use some sample datasets provided by MOE to examine how to deal with problematic datasets with suggestions on how to proceed. 3 / 118

4 Introduction - Loading Analysis Tool Pack Excel 2010 ships with some additional statistical (and other) analysis in a tookpack, but this may not be started. Check the Data Analysis menu to see if the Analysis Took Pack shows. 4 / 118

5 Introduction - Loading Analysis Tool Pack Use the File Options Add-Ins. 5 / 118

6 Introduction - Loading Analysis Tool Pack Select and load the Analysis TookPack This is not the best package (very limited functionality). You are better off using a proper package such as JMP/ R/ SAS, etc. 6 / 118

7 Introduction - Recommended Reading Helsel, D. R. and Hirsch, R. M. (online). Statistical Methods in Water Resources twri4a3/ Included as pdf file in package. 7 / 118

8 Introduction - Recommended Reading McBride, G. B. (2005). Using Statistical Methods for Water Quality Management. Wiley, New York 8 / 118

9 Introduction - Recommended Reading Helsel, D. R.(2005). Non-detects and data analysis: Statistics for censored environmental data. Wiley, New York. 9 / 118

10 Introduction - Recommended Reading 10 / 118

11 Introduction - Obtaining Data What is population of interest? Water quality parameter over the entire year? What is question of interest? Interested in average water quality? Interested in spread of water quality? Interested in higher percentiles (e.g. 95 th percentile?) Interested in comparing averages across sites? Interested in comparing percentiles across sites? (not part of this course) Interested in relationship of water quality to other covariates (such as rainfall, temperature)? (not part of this course) 11 / 118

12 Introduction - Obtaining Data The 3 R s of good data collection Randomization - this is what makes the sample representative of the population Every member of population should have a known probability of selection. Often this is equal. Replication - determines how reproducible are your estimates. This controls the reproducibility of your results. Stratification - controls for some explain sources of variation in your readings. Synoptic readings to compare sites controls for time effects. No amount of statistical wizardry can rescue badly collected data! 12 / 118

13 Introduction - Characteristics of Water Quality Data Often has a lower bound of zero and no negative values are possible (usually not a problem). Outliers, typically on the high side (can greatly affect results if not accounted for). Long right tails (skewness). Rather than being symmetric, there is a long tail to the right (try a log-transform). Non-normally distribution data. (Try a log-transform or resistant methods). Censoring (e.g. below detection limits (BDL)). (not part of this course) Seasonal Patterns (e.g. values tend to be higher in the summer than winter). Try stratification or modeling. Autocorrelation (e.g. measurements taken closely together in space and time tend to be related). (not part of this course, but see webinar for Air Quality branch) 13 / 118

14 Introduction - Dealing with outliers No formal definition of an outlier other than a data value that is unusual. Dealing with outliers: Through away old-fashioned rules such as Grubbs rule. Check data and see if a typing error etc. Perhaps population is more complex than previously thought. Try analysis with outlier in and outlier out. If it makes no difference, who cares. 14 / 118

15 Introduction - Missing Data Missing data mechanisms: MCAR (Missing completely at random). The missingness is unrelated to the response or another covariate in a study. For example, the sample vial broke when accidentally dropped. No bias introduced into results but standard error is larger than with complete data. MAR (Missing at random). The missingness is unrelated to the response, but may be related to another covariate. For example, rather than having equal number of samples in each month, fewer samples can be obtained in January because of access issues. If you know the adjustment factor, the collected data can be reweighed (not part of this course). IM (Informative missing). The missingness is related to the response value. For example, the concentration is so large that it exceeds the detection limits of the machine. Censoring is relatively easy to deal with modern software (not Excel). Other forms of informative missings are very difficult to deal with. (not part of this course). 15 / 118

16 Introduction - Transformations You say toe-mah-toe and I say toe-may-toe. Don t be afraid of transformations, e.g. ph is on a log-scale. Most common transformation in water quality is log() transform - discussed later. Sometimes not obvious what is correct scale (e.g. mpg or L/100 km?) See Hoaglin, David C., 1988, Transformations in everyday experience: Chance 1, / 118

17 Summary Statistics and Inference Basic Summary Statistics 17 / 118

18 Summary Statistics Objectives: Mean, Standard Deviation, Empirical Intervals Mean, Standard Errors, Confidence Intervals Mean vs. Geometric Mean Median vs. Mean Quantiles and Tolerance Intervals 18 / 118

19 Summary Statistics - Mean, Std Dev, Empirical Interval Sample data set: Arsenic (As) Concentrations from Ground Water (See H & H, Chapter 3, page 71) Available as MOE-workshop.xlsx document. 19 / 118

20 Summary Statistics - Mean Data must be collected using RRRs otherwise summary statistics are nonsense! Mean (Y ) = simple average of values. Use the average() function in Excel. Y As = Mean are greatly influenced by outliers so these need to be dealt with prior to analysis. 20 / 118

21 Summary Statistics - Standard Deviation Data must be collected using RRRs otherwise summary statistics are nonsense! (Y Standard Deviation s = i Y ) 2 n 1. This is a measure of spread of the INDIVIDUAL values in the sample (and by inference in the population). Use the stdev.s() function in Excel. s As = Standard deviations are greatly influenced by outliers so these need to be dealt with prior to analysis. 21 / 118

22 Summary Statistics - Empirical Rule Data must be collected using RRRs otherwise summary statistics are nonsense! Empirical Rule: Y ± 2s contains about 95% of INDIVIDUAL data values. Use the average() +/- 2*stdev.s() function in Excel. Empirical Rule for As is ( ) which contains 24/25 = 96% of observations. This gives a rough range for FUTURE observations. Empirical intervals are greatly influenced by outliers so these need to be dealt with prior to analysis. 22 / 118

23 Summary Statistics - SE & Confidence Interval Data must be collected using RRRs otherwise summary statistics are nonsense! Every time we take a NEW sample, the sample mean (Y ) will change. How reproducible are our results? The standard error (SE) measures the variability of future SAMPLE MEANS. The standard deviation (SD) measures the variability of future OBSERVATIONS. s n. Under Simple Random Sampling the SE of the sample mean is Different sample schemes and different statistics have different SE formula. Excel does NOT have a built-in function. SE = average(cellrange)/sqrt(count(cellrange)) = SE are greatly influenced by outliers so these need to be dealt with prior to analysis. 23 / 118

24 Summary Statistics - SE & Confidence Interval Data must be collected using RRRs otherwise summary statistics are nonsense! A 95% confidence interval is an estimate of the likely values of MEANS from 95% future samples. A 95% empirical interval is an estimate of the likely values of OBSERVATIONS from future samples. Under Simple Random Sampling, a 95% c.i. is found as Y ± t α/2,n 1 s n 95% c.i. lower bound = average(cellrange) -/+ confidence.t(.05,stddev.s(cellrange),count(cellrange)) For As data the 95% c.i. is ( ). CI are greatly influenced by outliers so these need to be dealt with prior to analysis. 24 / 118

25 Summary Statistics - Using Analysis ToolPack 25 / 118

26 Summary Statistics - Using Analysis ToolPack 26 / 118

27 Summary Statistics - Using Analysis ToolPack Usual summary statistics. Notice that the half-width of the 95% CI is provided and you still must compute the upper and lower bound = = / 118

28 Summary Statistics - Recap Data must be collected using RRRs otherwise summary statistics are nonsense! The mean measures the average value in the data set. The standard deviation (SD) measures the variability of future OBSERVATIONS. The standard error (SE) measures the variability of future SAMPLE MEANS. A 95% empirical interval (EI) is an estimate of the likely values of OBSERVATIONS from future samples. A 95% confidence interval (CI) is an estimate of the likely values of MEANS from 95% future samples. The SD does not depend on n. The SE and CI decline as a function of 1 n These statistics are greatly influenced by outliers. 28 / 118

29 Summary Statistics - Mean vs. Geometric Mean Means, SD implicitly assume a symmetric distribution so that averages and spread are meaningful. SE, CI are acceptable regardless of distribution in large samples. [How big is large??]. But they perform better with symmetric distributions in small samples. For skewed data (e.g. long right-tail) a logarithmic transform often improves inference. 29 / 118

30 Summary Statistics - Mean vs. Geometric Mean 30 / 118

31 Summary Statistics - Mean vs. Geometric Mean Two common transformation in water quality are log() vs. ln() Natural logarithms - ln() - base e, e.g. ln(100) = 4.60 and e 4.60 = exp(4.60) = 100. Common logarithsm - log() - base 10, e.g. log(100) = 2 and 10 2 = 100. Relationship: ln(y ) 2.3 log(y ). So it doesn t matter what you use, but be clear in documents what is used. CAUTION: Many computer packages (except Excel) use log() to represent the natural logarithmic transformation, and log10() to represent the common logarithmic transformation. 31 / 118

32 Summary Statistics - Mean vs. Geometric Mean Effect of transformation: 32 / 118

33 Summary Statistics - Mean vs. Geometric Mean Arithmetic Mean - Y = Yi n. Geometric Mean - GM Y = n Y 1 Y 2... Y n. Can also be found by using logarithmic transformation: Geometric Mean - GM Y = antilog(log(y )) Example: Y = (5, 12, 20). Y = GM Y = = GM Y = exp( ln(5)+ln(12)+ln(20) 3 ) = exp( ) = exp(2.363) = GM Y Y (this is always true) GM Y is less sensitive to outliers than Y and is often close to median. 33 / 118

34 Summary Statistics - Mean vs. Geometric Mean You can compute the usual statistics on the log() data... etc. 34 / 118

35 Summary Statistics - Mean vs. Geometric Mean You can compute the usual statistics on the log() data and back transform mean, EI, and CI bounds. DO NOT BACK TRANSFORM SD or SE. Careful to interpret back transformation of EI and CI bounds. 35 / 118

36 Summary Statistics - Mean vs. Geometric Mean - Recap Don t be afraid to use log() transformation, e.g. ph log() vs. ln(). CAUTION. Packages assume log() is natural logarithm. Suitable for highly skewed data. CAUTION on back transformation of mean and CI bounds - these refer to geometric mean in population (or very close to median). 36 / 118

37 Summary Statistics - Quantiles Quantiles (Percentiles) are measures of LOCATION within the dataset. Q p is the value of Y with at least 100p% of INDIVIDUAL values Q p and at least 100(1 p)% of INDIVIDUAL values Q p. Examples: Q.50 = 50th percentile = median has at least 50% of INDIVIDUAL values Q.50 and at least 50% of INDIVIDUAL values Q.50. Q.95 = 95th percentile has at least 95% of INDIVIDUAL values Q.95 and at least 5% of INDIVIDUAL values Q.95. Do not confuse the 95th percentile with 95% EI or 95% CI. 37 / 118

38 Summary Statistics - Quantiles Finding the p quantiles. [There are at least 5 different ways(!) to compute percentiles.] Sort the data from smallest to largest. Compute (n + 1)p. If (n + 1)p is integer, use Y [(n+1)p] +, i.e. the ((n + 1)p) th observation after sorting. if (n + 1)p is fractional, interpolated between two bracketing observations. In large samples, all of the methods will all give similar values. Example As data. n=25. Median = 50 th percentile. (n + 1)p = 13. Use the 13th observation = 19. The median of (13, 11, 17) is 13. The median of (13, 11, 17, 15) is ( )/2 = 14. The median of (13, 11, ) is still 13. The median of (13, 11, < 5, < 5, 9) is 9 where < 5 indicates BDL. 38 / 118

39 Summary Statistics - Quantiles Quantiles are (theoretically) invariant to log() transformation, i.e. if you find the quantile on transformed data and then back transform, you get the same value as the quantile on the original data. But, because of interpolation, the values may be slightly different. Excel function percentile.inc(datarange, p) looks at 100p/(n 1) and interpolates if this is not an integer. 39 / 118

40 Summary Statistics - Quantiles Comparison of Excel vs JMP (Different interpolation rules). 40 / 118

41 Summary Statistics - Quantiles Quantile plots: 1 Sort the data (Y ) from smallest to largest. Even if data is censored, it usually can be ordered. 2 Number the sorted values from 1 to n. 3 Create the plotting positions as p i = (i 0.4)/(n + 0.2). [Cunnane plotting positions (other formula exist)]. When tied values are present, each is assigned a separate plotting value and the tied values will create a vertical cliff on the plots. 4 Plot the Y variable on the bottom axis, and the plotting positions p i on the Y axis and joined the plots. 41 / 118

42 Summary Statistics - Quantiles Refer to [As] data: 42 / 118

43 Summary Statistics - Quantiles Refer to [As] data: 43 / 118

44 Summary Statistics - Quantiles Refer to log([as]) data: 44 / 118

45 Summary Statistics - Quantiles If you are willing to assume a specified distribution (e.g. normal) then you can estimate percentiles based on sample mean and sample standard deviation. p z Q p = Y + zs 45 / 118

46 Summary Statistics - Quantiles For small samples, the sample standard deviation (s) is a biased estimator of σ. The adjusted estimate of the percentile is: Q p,adjusted s = Y + z M(df ) where M(df ) is the mean of a standardized chi-distribution. [See my notes for more details.] 46 / 118

47 Summary Statistics - Quantiles Refer to Excel spreadsheet on As SummaryStatistcs 47 / 118

48 Summary Statistics - Tolerance Intervals Avoid NAKED ESTIMATES, e.g. never report a mean with out a SE. Tolerance intervals are confidence intervals for percentiles. We are 95% confident that no more than 10% of observations exceed xxxxx. This is a one-sided 95% confidence interval for the 90 th percentile. 48 / 118

49 Summary Statistics - TI - Large Samples Let p be the quantile of interest. Compute R upper = n(p) + z n(p)(1 p) The upper bound of the tolerance interval is then Y [R upper ]. For a 95% confidence interval use z = as seen earlier. Example, with n = 500, find a 95% one-sided tolerance interval for the 99 th percentile, i.e. you will be 95% confidence that 1% or less of future observations will exceed this value. The estimated quantile is the 500(.99) = 495 th value. z = 1.645: R upper = 500(.99) (.99)(1.99) = = The tolerance upper bound is then Y [498.36]. Some interpolation may be required to find Y [498.36]. 49 / 118

50 Summary Statistics - TI - Large Samples CAUTION: It may require extrapolation outside range of dataset (and cannot be computed). Example, if n = 10 then R upper = 10(.99) (.99)(1.99) = = st observation which is out of the range of the dataset (!). [Also for small samples replace the normal distribution with a binomial distribution as shown in Conover (1999)] 50 / 118

51 Summary Statistics - TI - Large Samples Refer to Excel spreadsheet on As SummaryStatistcs 51 / 118

52 Summary Statistics - TI - Observed Min/Max You will be (1 p n ) 100% confident that at least the fraction p of the future observations will lie below the maximum of the observed data. For the As dataset, n = 25 You are ( ) 100% = 72% confident that at least 95% of future observations will lie below the largest value, or 72% confident that no more than 5% of observations will exceed the maximum. You are ( ) 100% = 22% confident that at least 99% of future observations will lie below the largest value, or only 22% confident that no more than 1% of observations will exceed the maximum If n = 5 (not much useful information available) You are ( ) 100% = 23% confident that at least 95% of future observations will lie below the largest value,or only 23% confident that no more than 5% of observations will exceed the maximum. 52 / 118

53 Summary Statistics - TI - Observed Min/Max Refer back to [As] spreadsheet 53 / 118

54 Summary Statistics - TI - Using normal distribution The previous slides show that very little information is available in small samples if you are not willing to assume a distribution for the data. If the assumption of normality is sensible, you can do better. Basic form is : TI p = Y + ks where k is obtained from tables. prc263.htm Gerow, K. and Bielen, Confidence Intervals for Percentiles: An Application to Estimation of Potential Maximum Biomass of Trout in Wyoming Streams. North American Journal of Fisheries Management 19, CIFPAA>2.0.CO;2 and Excel spreadsheet. CAUTION: TI are very sensitive to the normality assumption! 54 / 118

55 Summary Statistics - TI - Using normal distribution Refer to the [As] workbook. 55 / 118

56 Summary Statistics - Quantile and TI - Recap Quantiles measure LOCATION of INDIVIDUAL observations (e.g. median or 95 th percentile. Tolerance Intervals (TI) are confidence intervals on percentiles, e.g. you are 90% sure that no more than 5% of INDIVIDUAL values exceed xxxxx. Quantiles can be estimated: Using non-parametric methods. Find Q p = Y np. [The (np) varies among methods.] Assuming a distribution (e.g. normal). Q p = Y + zs Tolerance intervals can be estimated: Using non-parametric methods. Require large samples. Based on min/max of observed data. Frightening how little knowledge is available in small samples. Based on normal distribution. Assumption of normality is crucial for extreme quantiles (!) 56 / 118

57 Summary Statistics - Final Summary Data must be collected using RRRs; otherwise summary statistics have no meaning. Are you interested in the AVERAGE Estimate mean, SE, and CI CI say NOTHING about INDIVIDUAL observations. Are you interested in the SPREAD Estimate mean, SD, and EI Two-sided tolerance intervals (not covered in this webinar). Are you interested in EXTREMES (e.g. higher order quantiles) Estimate quantiles and TI Non-parametric (with large samples), or parametric (with small samples), but the latter is EXTREMELY sensitive to assumption of normality. 57 / 118

58 Assessing Normality Assessing Normality - Normal Quantile Plots 58 / 118

59 Assessing Normality Objectives: Constructing a normal quantile plot. Assessing normality based on quantile plot. Detecting outliers and skewness based on quantile plot 59 / 118

60 Assessing Normality - Quantile Plots Quantile-plots are a graphical method to assess distributional assumptions: Easier to assess straight-line fit rather than fitting to a curve It is not necessary to create arbitrary bins as is needed for histograms. All of the data are displayed unlike box-plots. Every point in the data can be displayed without overlap. 60 / 118

61 Assessing Normality - Normal Quantile Plot Construction: 1 Sort the data (Y ) from smallest to largest. 2 Number the sorted values from 1 to n. No adjustment is made for tied values. 3 Create the plotting positions as p i = (i 0.4)/(n + 0.2). 4 Compute the normal quantile (Q p ) using the NORMINV(p i ) or the NORM.S.INV(p i ) function in Excel. 5 Plot Q p (on the X-axis) vs Y on the Y-axis. If the distribution is correct for the data, the points should lie on an approximate straight line. Slope of the line estimates s (the sample standard deviation); Value of the curve at X = 0 estimates the mean. 61 / 118

62 Assessing Normality - Quantile Plots Quantile plot of the original data. 62 / 118

63 Assessing Normality - Quantile Plots Quantile plot of the log(as) data. 63 / 118

64 Assessing Normality - Quantile Plots 64 / 118

68 Assessing Normality - Summary Q-Q plots are easy way to assess normality. Look for departures from linearity (but don t over-interpret the plots). For inference about the MEAN, not too crucial that underlying distribution is normal except in (very) small sample sizes. For inference about extreme percentiles, it is crucial that normality assumptions be satisfied. 68 / 118

69 Linear Models Comparing Water Quality Readings 69 / 118

70 Comparing Water Quality Readings Objectives: How does WQ compare between 2+ sites Types of designs (paired vs. unpaired) Case study - French Creek 70 / 118

71 Comparing Water Quality Readings Types of designs Paired/blocked Paired (2 sites) or Blocked (2+ sites) designs are similar Synoptic (same day) or near synoptic (same week?) readings are taken Interested in the average difference. Not necessary to have a random sample to times. It may be preferable to select times to enhance contrast (e.g. sample at low and peak flows). Paired t-test; Single Factor Randomized Blocked ANOVA Independent samples (not part of this course). 2+ sites to be compared. Separate, random samples taken from each site. Interested in compare the MEANS across the sites. Independent sample t-test; Single Factor Completely Randomized Design ANOVA (a.k.a. One-way ANOVA) You MUST match the analysis to the design! 71 / 118

72 Comparing Water Quality Readings - Case Study French Creek - data set available Five locations along French Creek Monthly + two sets of 5-in-30 samples starting late-july and late-october (Near) Synoptic data How do the readings compare across the sites? 72 / 118

73 Comparing Water Quality Readings - 2 Paired Sites Compare readings in 2 sites. Align the data by date Find the difference in readings or log(ratio) of readings Use the difference if the range of values is small so that the differences between sites are relatively similar (e.g. a consistent difference of about 2 units). Use the log(ratio) if the range of values is large so that differences between sites are NOT relatively similar (e.g. range from 2 to 200 units), but RATIO of readings (e.g. one site s readings is about twice the other site). Compute the mean difference, se of mean difference, 95% confidence interval for mean difference and see if the 95% confidence interval includes 0. If want a p-value, test the hypothesis that mean difference in population is zero. 73 / 118

74 Comparing Water Quality Readings - 2 Paired Sites Turbidity Readings Sampling BARCLAY GRAFTON NEW WINCH Week BRG COOMBS ROAD HWY -ESTER / 118

75 Comparing Water Quality Readings - 2 Paired Sites Start with plots over time for each site: 75 / 118

76 Comparing Water Quality Readings - 2 Paired Sites Compute the difference or log(ratio) between two sites (e.g. Barclay vs. Coombs). Drop cases where missing values are present. 76 / 118

77 Comparing Water Quality Readings - 2 Paired Sites 77 / 118

78 Comparing Water Quality Readings - 2 Paired Sites Compute using Excel functions Number of differences/log-ratios - count() Mean difference/log-ratio - average() Std dev of diff/log-ration - stdev.s () 95% CI half width - confidence.t() 95% CI mean ± 95% CI half width t-test for testing equality of means - t.test() Also take anti-logs of mean and 95% CI for log-ratio 78 / 118

79 Comparing Water Quality Readings - 2 Paired Sites 79 / 118

80 Comparing Water Quality Readings - 2 Paired Sites Barclay averages.947 NTU higher than Coombs with a 95% c.i. of between 0.45 and 1.44 NTU higher than Coombs. Barclay is, on average, approx 1.98x (95% CI 1.70x to 2.30x) larger than Coombs. CAUTION: 95% c.i. say nothing about individual differences or individual ratios, i.e. NOT CORRECT TO SAY that 95% of differences lie between 0.45 and 1.44 NTU. 80 / 118

81 Comparing Water Quality Readings - 2 Paired Sites CAUTIONS: Excel does not deal with missing data very nicely. Look what happens if last reading for Coombs is missing. 81 / 118

82 Comparing Water Quality Readings - 2 Paired Sites Similar output from JMP with additional graphs to show that log(ratio) is likely better choice than difference: 82 / 118

83 Comparing Water Quality Readings - 2 Paired Sites JMP also handles missing values automatically: 83 / 118

84 Comparing Water Quality Readings - 2 Paired Sites A formal p-value is un-necessary but can be obtained as well. Because the p-values are very small, there is strong evidence of a difference (on average) between the two sites. 84 / 118

85 Comparing Water Quality Readings - 2 Paired Sites Summary Pairing is induced by synoptic readings Delete any pairs with missing values. More advanced software (e.g. R/SAS/JMP) can also incorporate missing data but this is beyond scope of course. CAUTION: EXCEL does NOT handle missing data well. Compute differences and/or log(ratio) Use differences if readings are similar over time Use log(ratio) if large variation in readings and RATIO is consistent over time Compute mean difference or log-ratio and 95% confidence interval for population mean difference or log-ratio. Is 0 included in the 95% confidence interval? If so, then there is no evidence of a difference (on average). CAUTION: 95% confidence interval say NOTHING about individual differences or log-ratios Use anti-log on mean and 95% c.i. to convert log-ratio back to ratios. 85 / 118

86 Comparing Water Quality Readings - 3+ Paired Sites It is possible to extend the analysis to 3+ more sites with synoptic readings. Make an array of week by site Record actual values or ln(values) CAUTION: EXCEL does not ALLOW for ANY missing data. You must exclude entire sampling week if any data is missing. This can lead to a LARGE loss of data. JMP/SAS/R can gracefully deal with missing data. 86 / 118

87 Comparing Water Quality Readings - 3+ Paired Sites Notice that readings on are dropped FOR all sites because missing data at New Highway 87 / 118

88 Comparing Water Quality Readings - 3+ Paired Sites Known as a Randomized Block Design Blocks = Synoptic Times = device for pairing up observations that are affected in similar way. Assume that differences among sites is relatively consistent across the blocks == NO INTERACTION between blocks and sites. (OR) Assume that ratio among sites is relatively consistent across the blocks so that differences of log(values) is relatively consistent (e.g. site A might always be about 2x larger than site B). Again start with plot of values as seen earlier 88 / 118

89 Comparing Water Quality Readings - 3+ Paired Sites Start with plots over time for each site: 89 / 118

90 Comparing Water Quality Readings - 3+ Paired Sites 90 / 118

91 Comparing Water Quality Readings - 3+ Paired Sites 91 / 118

92 Comparing Water Quality Readings - 3+ Paired Sites Here, column effect = effects of SITES P-value = 0.02, so there is some evidence of a consistent difference in the MEANS among sites Does NOT indicate which sites could have the same or different means. Need to follow-up with a Tukey Multiple Comparison Procedure to identify which pairs of sites could have different means. 92 / 118

93 Comparing Water Quality Readings - 3+ Paired Sites Look at estimates of MARGINAL means: Critical range is CR = Q MSE a # blocks where Q a is value from Studentized range with df 1 = #sites and df 2 = df MSE. In this case we look for (5, 52) df (see previous slide) at duke.edu/courses/spring98/sta110c/qtable.html and find 0.65 the Q a = 4.20 and CR = = Then two means could be different if Y 1 Y 2 > CR = / 118

94 Comparing Water Quality Readings - 3+ Paired Sites This is VERY tedious and error prone in EXCEL use a proper package such as JMP/ R/ SAS etc. The output is automatic and more-informative and can handle missing data. 94 / 118

95 Comparing Water Quality Readings - 3+ Paired Sites P-value is small, so there is some evidence of a difference in means among the sites. It does not indicate where the difference may lie. 95 / 118

96 Comparing Water Quality Readings - 3+ Paired Sites This indicates which Sites could have the same mean. Think of paint-chips to understand overlap in ranges of sites that could be the same. 96 / 118

97 Comparing Water Quality Readings - 3+ Paired Sites Provides estimates of effects and confidence intervals for each pairwise difference. 97 / 118

98 Comparing Water Quality Readings - Summary Typically used for synoptic data to see if sites are comparable Exactly 2 sites: Find difference or log(ratio) of readings from both sites. Drop any sites with missing data. Find mean and 95% confidence interval for difference in MEAN See if 0 is included in the confidence interval. Find p-value using Paired t-test. 98 / 118

99 Comparing Water Quality Readings - Summary 3+ Site: Analyze either raw data or log(data). Use Randomized Block Design analysis (Excel: Two-factor with no replication). Look at ANOVA table at either Rows/Columns effects that correspond to SITES. Program Tukey Multiple Comparison procedure by hand (groan)! CAUTION: EXCEL does not deal with missing values correctly - GIVES WRONG ANSWERS. CAUTION: EXCEL is very clumsy in finding where the differences lie. You must program Tukey procedure. You will make mistakes in doing this! CAUTION: EXCEL does not provide other output to check assumptions of the models. Avoid EXCEL use a proper package such as JMP/ R/ SAS! 99 / 118

100 Case Study - Barclay Bridge Turbidity Case Study - French Creek - Barclay Bridge - Turbidity 100 / 118

101 Case Study - Barclay Bridge Turbidity Monthly sampling in 30 samples 101 / 118

102 Case Study - Barclay Bridge Turbidity What are the objectives? Find the distribution of turbidity across the year? Estimate the 95 th percentile across the year? Estimate the mean and percentiles only in November? Problems: Non-random sampling across the year with some days having a higher probability of sampling than other days. More samples deliberately selected in August (when the turbidity is low) and more samples deliberately taken in November when turbidity is high. High autocorrelation in values taken close together in the 5-in-30 samples. Standard errors will be understated, i.e. you will think you are more precise than you are. Not clear how to interpret means and percentiles for part of a year. The 95 th percentile based on the 5-in-30 will NOT estimate the 95 th percentile for the year. 102 / 118

103 Case Study - Barclay Bridge Turbidity Example of bias in computing percentiles. Simulated data to follow data based on previous curve with some random noise around the curve. Computed 95 th percentile based on yearly data and on sample dates data. 103 / 118

104 Case Study - Barclay Bridge Turbidity Example of bias in computing means. Simulated data to follow data based on previous curve with some random noise around the curve. Computed mean based on yearly data and on sample dates data. 104 / 118

105 Case Study - Barclay Bridge Turbidity Not clear what to do with this type of data without some consideration of filling in some of the missing data. 105 / 118

106 Case Study - Mercantile Readings are taken monthly, except 5-in-30 days samples are again taken in August and November. There are many censored readings (indicated by the < character in the adjacent column. There duplicate and split samples for QA/QC work. There is seasonality in some of the characteristics. 106 / 118

107 Case Study - Mercantile Duplicate and split-sample measurements. These are NOT independent observations and the usual way to deal with these is to the the average of the duplicate or split-sample measurements. Outliers. The sole outliers for NO3 and Turbidity are synoptic. What happened? For some variables (e.g. NO3), weak seasonality present so some pooling over all month in a year is a possibility. I am still worried about the 5-in-30 readings as serial correlation may be a problem? Some variables are highly censored while others are lightly censored. There is no easy way to deal with censoring in Excel other than to either ignore it (i.e. treat the limit as the data value) or use 1/2 of the detection limit as the data value. 107 / 118

108 Case Study - Mercantile Turbidity and NO3 in Mercantile Creek: 108 / 118

109 Case Study - Mercantile Several interesting features. There appears to be seasonality with high readings in October and April, but they are not consistent over time (e.g. look what happened in 2003). Sampling intensity is not uniform over the years with some months missed and no apparent 5-in-30 sampling occasions. A very large value occurred in November 2002 with very small values on the months on either side of it indicating a very volatile system. As in the French Creek dataset, it is not clear what the 95th percentile is supposed to be measuring? Just the peak events? Across the entire year? There appears to be an increasing trend in both the mean and variability. Fitting a trend line to this data set is beyond the capabilities of Excel because of the problems above. 109 / 118

110 Case Study - Mercantile Try a linear fit to the Turbidity data after dropping the outliers in This was done in R (don t use Excel - gives WRONG results for many regressions!) More details on fitting linear models available in my course notes. Estimate Std. Error t value Pr(> t (Intercept) Date Standard errors may be too small because of the autocorrelation in the residuals. Strong evidence of an increase over time. 110 / 118

111 Case Study - Mercantile Try a linear fit to the Turbidity data after dropping the outliers in / 118

112 Case Study - Mercantile Try a linear fit to the Turbidity data after dropping the outliers in Residual plots (lower left) shows increase in variance with mean. 112 / 118

113 Case Study - Mercantile Try a linear fit to the NO 3 data after dropping the outliers in This was done in R (don t use Excel - gives WRONG results for many regressions!) More details on fitting linear models available in my course notes. Estimate Std. Error t value Pr(> t ) (Intercept) e e Date e e Standard errors may be too small because of the autocorrelation in the residuals. No evidence of an increase over time. 113 / 118

114 Case Study - Mercantile Try a linear fit to the NO 3 data after dropping the outliers in / 118

115 Case Study - Mercantile Try a linear fit to the NO 3 data after dropping the outliers in Residual plots now look all ok. 115 / 118

116 Summary RRR s No easy way to compute means and percentiles for non-randomly selected data (e.g. monthly + 5-in-30 data). More flexibility when comparing across sites or trends over time. Some caution needed if readings are too close together (e.g. hourly or daily). See my webinar for Air Quality. What are you interested in? Averages - estimate means, SE, and CI Spread - estimate means, SD, and EI Extremes - estimate percentiles and TI (either parametric or non-parametric). CAUTION. These results are extremely sensitive to violations of the RRRs and distributional assumptions. Assess normality using Normal Quantile plots. Trends - use linear models (see my webinar for Air Quality) No amount of statistical wizardry can rescue poorly corrected data! You will be severely constrained in your analyses if you only use Excel! 116 / 118

117 Summary Missing values? If missingness is MCAR, then no problems. MAR/ IM are more/ very difficulty to deal with. Outliers? Run analysis with outliers in and with outliers out. If no difference then who cares. 117 / 118

118 Summary Other possible analyzes not-part of this course: Quantile-regression where you model the change in percentiles, e.g. has the 90 th percentile changed over time? Testing for changes in the standard deviation over time rather than the mean to see if variability has changed over time. Modeling the number of events (e.g. days exceeding WQ guidelines in a month) and if they change over time. Further help, contact Carl Schwarz stat.sfu.ca) 118 / 118