A (very) short course on the analysis of Water Quality Data

Size: px
Start display at page:

Download "A (very) short course on the analysis of Water Quality Data"

Transcription

1 A (very) short course on the analysis of Water Quality Data Carl James Schwarz Department of Statistics and Actuarial Science Simon Fraser University Burnaby, BC, Canada stat.sfu.ca 1 / 118

2 Introduction Introduction 2 / 118

3 Introduction Objectives: Review the concepts of mean, standard deviations, standard errors, confidence intervals and percentiles in terms of water quality studies. Show how compare water quality using synoptic reading at two or more sites. Use some sample datasets provided by MOE to examine how to deal with problematic datasets with suggestions on how to proceed. 3 / 118

4 Introduction - Loading Analysis Tool Pack Excel 2010 ships with some additional statistical (and other) analysis in a tookpack, but this may not be started. Check the Data Analysis menu to see if the Analysis Took Pack shows. 4 / 118

5 Introduction - Loading Analysis Tool Pack Use the File Options Add-Ins. 5 / 118

6 Introduction - Loading Analysis Tool Pack Select and load the Analysis TookPack This is not the best package (very limited functionality). You are better off using a proper package such as JMP/ R/ SAS, etc. 6 / 118

7 Introduction - Recommended Reading Helsel, D. R. and Hirsch, R. M. (online). Statistical Methods in Water Resources twri4a3/ Included as pdf file in package. 7 / 118

8 Introduction - Recommended Reading McBride, G. B. (2005). Using Statistical Methods for Water Quality Management. Wiley, New York 8 / 118

9 Introduction - Recommended Reading Helsel, D. R.(2005). Non-detects and data analysis: Statistics for censored environmental data. Wiley, New York. 9 / 118

10 Introduction - Recommended Reading 10 / 118

11 Introduction - Obtaining Data What is population of interest? Water quality parameter over the entire year? What is question of interest? Interested in average water quality? Interested in spread of water quality? Interested in higher percentiles (e.g. 95 th percentile?) Interested in comparing averages across sites? Interested in comparing percentiles across sites? (not part of this course) Interested in relationship of water quality to other covariates (such as rainfall, temperature)? (not part of this course) 11 / 118

12 Introduction - Obtaining Data The 3 R s of good data collection Randomization - this is what makes the sample representative of the population Every member of population should have a known probability of selection. Often this is equal. Replication - determines how reproducible are your estimates. This controls the reproducibility of your results. Stratification - controls for some explain sources of variation in your readings. Synoptic readings to compare sites controls for time effects. No amount of statistical wizardry can rescue badly collected data! 12 / 118

13 Introduction - Characteristics of Water Quality Data Often has a lower bound of zero and no negative values are possible (usually not a problem). Outliers, typically on the high side (can greatly affect results if not accounted for). Long right tails (skewness). Rather than being symmetric, there is a long tail to the right (try a log-transform). Non-normally distribution data. (Try a log-transform or resistant methods). Censoring (e.g. below detection limits (BDL)). (not part of this course) Seasonal Patterns (e.g. values tend to be higher in the summer than winter). Try stratification or modeling. Autocorrelation (e.g. measurements taken closely together in space and time tend to be related). (not part of this course, but see webinar for Air Quality branch) 13 / 118

14 Introduction - Dealing with outliers No formal definition of an outlier other than a data value that is unusual. Dealing with outliers: Through away old-fashioned rules such as Grubbs rule. Check data and see if a typing error etc. Perhaps population is more complex than previously thought. Try analysis with outlier in and outlier out. If it makes no difference, who cares. 14 / 118

15 Introduction - Missing Data Missing data mechanisms: MCAR (Missing completely at random). The missingness is unrelated to the response or another covariate in a study. For example, the sample vial broke when accidentally dropped. No bias introduced into results but standard error is larger than with complete data. MAR (Missing at random). The missingness is unrelated to the response, but may be related to another covariate. For example, rather than having equal number of samples in each month, fewer samples can be obtained in January because of access issues. If you know the adjustment factor, the collected data can be reweighed (not part of this course). IM (Informative missing). The missingness is related to the response value. For example, the concentration is so large that it exceeds the detection limits of the machine. Censoring is relatively easy to deal with modern software (not Excel). Other forms of informative missings are very difficult to deal with. (not part of this course). 15 / 118

16 Introduction - Transformations You say toe-mah-toe and I say toe-may-toe. Don t be afraid of transformations, e.g. ph is on a log-scale. Most common transformation in water quality is log() transform - discussed later. Sometimes not obvious what is correct scale (e.g. mpg or L/100 km?) See Hoaglin, David C., 1988, Transformations in everyday experience: Chance 1, / 118

17 Summary Statistics and Inference Basic Summary Statistics 17 / 118

18 Summary Statistics Objectives: Mean, Standard Deviation, Empirical Intervals Mean, Standard Errors, Confidence Intervals Mean vs. Geometric Mean Median vs. Mean Quantiles and Tolerance Intervals 18 / 118

19 Summary Statistics - Mean, Std Dev, Empirical Interval Sample data set: Arsenic (As) Concentrations from Ground Water (See H & H, Chapter 3, page 71) Available as MOE-workshop.xlsx document. 19 / 118

20 Summary Statistics - Mean Data must be collected using RRRs otherwise summary statistics are nonsense! Mean (Y ) = simple average of values. Use the average() function in Excel. Y As = Mean are greatly influenced by outliers so these need to be dealt with prior to analysis. 20 / 118

21 Summary Statistics - Standard Deviation Data must be collected using RRRs otherwise summary statistics are nonsense! (Y Standard Deviation s = i Y ) 2 n 1. This is a measure of spread of the INDIVIDUAL values in the sample (and by inference in the population). Use the stdev.s() function in Excel. s As = Standard deviations are greatly influenced by outliers so these need to be dealt with prior to analysis. 21 / 118

22 Summary Statistics - Empirical Rule Data must be collected using RRRs otherwise summary statistics are nonsense! Empirical Rule: Y ± 2s contains about 95% of INDIVIDUAL data values. Use the average() +/- 2*stdev.s() function in Excel. Empirical Rule for As is ( ) which contains 24/25 = 96% of observations. This gives a rough range for FUTURE observations. Empirical intervals are greatly influenced by outliers so these need to be dealt with prior to analysis. 22 / 118

23 Summary Statistics - SE & Confidence Interval Data must be collected using RRRs otherwise summary statistics are nonsense! Every time we take a NEW sample, the sample mean (Y ) will change. How reproducible are our results? The standard error (SE) measures the variability of future SAMPLE MEANS. The standard deviation (SD) measures the variability of future OBSERVATIONS. s n. Under Simple Random Sampling the SE of the sample mean is Different sample schemes and different statistics have different SE formula. Excel does NOT have a built-in function. SE = average(cellrange)/sqrt(count(cellrange)) = SE are greatly influenced by outliers so these need to be dealt with prior to analysis. 23 / 118

24 Summary Statistics - SE & Confidence Interval Data must be collected using RRRs otherwise summary statistics are nonsense! A 95% confidence interval is an estimate of the likely values of MEANS from 95% future samples. A 95% empirical interval is an estimate of the likely values of OBSERVATIONS from future samples. Under Simple Random Sampling, a 95% c.i. is found as Y ± t α/2,n 1 s n 95% c.i. lower bound = average(cellrange) -/+ confidence.t(.05,stddev.s(cellrange),count(cellrange)) For As data the 95% c.i. is ( ). CI are greatly influenced by outliers so these need to be dealt with prior to analysis. 24 / 118

25 Summary Statistics - Using Analysis ToolPack 25 / 118

26 Summary Statistics - Using Analysis ToolPack 26 / 118

27 Summary Statistics - Using Analysis ToolPack Usual summary statistics. Notice that the half-width of the 95% CI is provided and you still must compute the upper and lower bound = = / 118

28 Summary Statistics - Recap Data must be collected using RRRs otherwise summary statistics are nonsense! The mean measures the average value in the data set. The standard deviation (SD) measures the variability of future OBSERVATIONS. The standard error (SE) measures the variability of future SAMPLE MEANS. A 95% empirical interval (EI) is an estimate of the likely values of OBSERVATIONS from future samples. A 95% confidence interval (CI) is an estimate of the likely values of MEANS from 95% future samples. The SD does not depend on n. The SE and CI decline as a function of 1 n These statistics are greatly influenced by outliers. 28 / 118

29 Summary Statistics - Mean vs. Geometric Mean Means, SD implicitly assume a symmetric distribution so that averages and spread are meaningful. SE, CI are acceptable regardless of distribution in large samples. [How big is large??]. But they perform better with symmetric distributions in small samples. For skewed data (e.g. long right-tail) a logarithmic transform often improves inference. 29 / 118

30 Summary Statistics - Mean vs. Geometric Mean 30 / 118

31 Summary Statistics - Mean vs. Geometric Mean Two common transformation in water quality are log() vs. ln() Natural logarithms - ln() - base e, e.g. ln(100) = 4.60 and e 4.60 = exp(4.60) = 100. Common logarithsm - log() - base 10, e.g. log(100) = 2 and 10 2 = 100. Relationship: ln(y ) 2.3 log(y ). So it doesn t matter what you use, but be clear in documents what is used. CAUTION: Many computer packages (except Excel) use log() to represent the natural logarithmic transformation, and log10() to represent the common logarithmic transformation. 31 / 118

32 Summary Statistics - Mean vs. Geometric Mean Effect of transformation: 32 / 118

33 Summary Statistics - Mean vs. Geometric Mean Arithmetic Mean - Y = Yi n. Geometric Mean - GM Y = n Y 1 Y 2... Y n. Can also be found by using logarithmic transformation: Geometric Mean - GM Y = antilog(log(y )) Example: Y = (5, 12, 20). Y = GM Y = = GM Y = exp( ln(5)+ln(12)+ln(20) 3 ) = exp( ) = exp(2.363) = GM Y Y (this is always true) GM Y is less sensitive to outliers than Y and is often close to median. 33 / 118

34 Summary Statistics - Mean vs. Geometric Mean You can compute the usual statistics on the log() data... etc. 34 / 118

35 Summary Statistics - Mean vs. Geometric Mean You can compute the usual statistics on the log() data and back transform mean, EI, and CI bounds. DO NOT BACK TRANSFORM SD or SE. Careful to interpret back transformation of EI and CI bounds. 35 / 118

36 Summary Statistics - Mean vs. Geometric Mean - Recap Don t be afraid to use log() transformation, e.g. ph log() vs. ln(). CAUTION. Packages assume log() is natural logarithm. Suitable for highly skewed data. CAUTION on back transformation of mean and CI bounds - these refer to geometric mean in population (or very close to median). 36 / 118

37 Summary Statistics - Quantiles Quantiles (Percentiles) are measures of LOCATION within the dataset. Q p is the value of Y with at least 100p% of INDIVIDUAL values Q p and at least 100(1 p)% of INDIVIDUAL values Q p. Examples: Q.50 = 50th percentile = median has at least 50% of INDIVIDUAL values Q.50 and at least 50% of INDIVIDUAL values Q.50. Q.95 = 95th percentile has at least 95% of INDIVIDUAL values Q.95 and at least 5% of INDIVIDUAL values Q.95. Do not confuse the 95th percentile with 95% EI or 95% CI. 37 / 118

38 Summary Statistics - Quantiles Finding the p quantiles. [There are at least 5 different ways(!) to compute percentiles.] Sort the data from smallest to largest. Compute (n + 1)p. If (n + 1)p is integer, use Y [(n+1)p] +, i.e. the ((n + 1)p) th observation after sorting. if (n + 1)p is fractional, interpolated between two bracketing observations. In large samples, all of the methods will all give similar values. Example As data. n=25. Median = 50 th percentile. (n + 1)p = 13. Use the 13th observation = 19. The median of (13, 11, 17) is 13. The median of (13, 11, 17, 15) is ( )/2 = 14. The median of (13, 11, ) is still 13. The median of (13, 11, < 5, < 5, 9) is 9 where < 5 indicates BDL. 38 / 118

39 Summary Statistics - Quantiles Quantiles are (theoretically) invariant to log() transformation, i.e. if you find the quantile on transformed data and then back transform, you get the same value as the quantile on the original data. But, because of interpolation, the values may be slightly different. Excel function percentile.inc(datarange, p) looks at 100p/(n 1) and interpolates if this is not an integer. 39 / 118

40 Summary Statistics - Quantiles Comparison of Excel vs JMP (Different interpolation rules). 40 / 118

41 Summary Statistics - Quantiles Quantile plots: 1 Sort the data (Y ) from smallest to largest. Even if data is censored, it usually can be ordered. 2 Number the sorted values from 1 to n. 3 Create the plotting positions as p i = (i 0.4)/(n + 0.2). [Cunnane plotting positions (other formula exist)]. When tied values are present, each is assigned a separate plotting value and the tied values will create a vertical cliff on the plots. 4 Plot the Y variable on the bottom axis, and the plotting positions p i on the Y axis and joined the plots. 41 / 118

42 Summary Statistics - Quantiles Refer to [As] data: 42 / 118

43 Summary Statistics - Quantiles Refer to [As] data: 43 / 118

44 Summary Statistics - Quantiles Refer to log([as]) data: 44 / 118

45 Summary Statistics - Quantiles If you are willing to assume a specified distribution (e.g. normal) then you can estimate percentiles based on sample mean and sample standard deviation. p z Q p = Y + zs 45 / 118

46 Summary Statistics - Quantiles For small samples, the sample standard deviation (s) is a biased estimator of σ. The adjusted estimate of the percentile is: Q p,adjusted s = Y + z M(df ) where M(df ) is the mean of a standardized chi-distribution. [See my notes for more details.] 46 / 118

47 Summary Statistics - Quantiles Refer to Excel spreadsheet on As SummaryStatistcs 47 / 118

48 Summary Statistics - Tolerance Intervals Avoid NAKED ESTIMATES, e.g. never report a mean with out a SE. Tolerance intervals are confidence intervals for percentiles. We are 95% confident that no more than 10% of observations exceed xxxxx. This is a one-sided 95% confidence interval for the 90 th percentile. 48 / 118

49 Summary Statistics - TI - Large Samples Let p be the quantile of interest. Compute R upper = n(p) + z n(p)(1 p) The upper bound of the tolerance interval is then Y [R upper ]. For a 95% confidence interval use z = as seen earlier. Example, with n = 500, find a 95% one-sided tolerance interval for the 99 th percentile, i.e. you will be 95% confidence that 1% or less of future observations will exceed this value. The estimated quantile is the 500(.99) = 495 th value. z = 1.645: R upper = 500(.99) (.99)(1.99) = = The tolerance upper bound is then Y [498.36]. Some interpolation may be required to find Y [498.36]. 49 / 118

50 Summary Statistics - TI - Large Samples CAUTION: It may require extrapolation outside range of dataset (and cannot be computed). Example, if n = 10 then R upper = 10(.99) (.99)(1.99) = = st observation which is out of the range of the dataset (!). [Also for small samples replace the normal distribution with a binomial distribution as shown in Conover (1999)] 50 / 118

51 Summary Statistics - TI - Large Samples Refer to Excel spreadsheet on As SummaryStatistcs 51 / 118

52 Summary Statistics - TI - Observed Min/Max You will be (1 p n ) 100% confident that at least the fraction p of the future observations will lie below the maximum of the observed data. For the As dataset, n = 25 You are ( ) 100% = 72% confident that at least 95% of future observations will lie below the largest value, or 72% confident that no more than 5% of observations will exceed the maximum. You are ( ) 100% = 22% confident that at least 99% of future observations will lie below the largest value, or only 22% confident that no more than 1% of observations will exceed the maximum If n = 5 (not much useful information available) You are ( ) 100% = 23% confident that at least 95% of future observations will lie below the largest value,or only 23% confident that no more than 5% of observations will exceed the maximum. 52 / 118

53 Summary Statistics - TI - Observed Min/Max Refer back to [As] spreadsheet 53 / 118

54 Summary Statistics - TI - Using normal distribution The previous slides show that very little information is available in small samples if you are not willing to assume a distribution for the data. If the assumption of normality is sensible, you can do better. Basic form is : TI p = Y + ks where k is obtained from tables. prc263.htm Gerow, K. and Bielen, Confidence Intervals for Percentiles: An Application to Estimation of Potential Maximum Biomass of Trout in Wyoming Streams. North American Journal of Fisheries Management 19, CIFPAA>2.0.CO;2 and Excel spreadsheet. CAUTION: TI are very sensitive to the normality assumption! 54 / 118

55 Summary Statistics - TI - Using normal distribution Refer to the [As] workbook. 55 / 118

56 Summary Statistics - Quantile and TI - Recap Quantiles measure LOCATION of INDIVIDUAL observations (e.g. median or 95 th percentile. Tolerance Intervals (TI) are confidence intervals on percentiles, e.g. you are 90% sure that no more than 5% of INDIVIDUAL values exceed xxxxx. Quantiles can be estimated: Using non-parametric methods. Find Q p = Y np. [The (np) varies among methods.] Assuming a distribution (e.g. normal). Q p = Y + zs Tolerance intervals can be estimated: Using non-parametric methods. Require large samples. Based on min/max of observed data. Frightening how little knowledge is available in small samples. Based on normal distribution. Assumption of normality is crucial for extreme quantiles (!) 56 / 118

57 Summary Statistics - Final Summary Data must be collected using RRRs; otherwise summary statistics have no meaning. Are you interested in the AVERAGE Estimate mean, SE, and CI CI say NOTHING about INDIVIDUAL observations. Are you interested in the SPREAD Estimate mean, SD, and EI Two-sided tolerance intervals (not covered in this webinar). Are you interested in EXTREMES (e.g. higher order quantiles) Estimate quantiles and TI Non-parametric (with large samples), or parametric (with small samples), but the latter is EXTREMELY sensitive to assumption of normality. 57 / 118

58 Assessing Normality Assessing Normality - Normal Quantile Plots 58 / 118

59 Assessing Normality Objectives: Constructing a normal quantile plot. Assessing normality based on quantile plot. Detecting outliers and skewness based on quantile plot 59 / 118

60 Assessing Normality - Quantile Plots Quantile-plots are a graphical method to assess distributional assumptions: Easier to assess straight-line fit rather than fitting to a curve It is not necessary to create arbitrary bins as is needed for histograms. All of the data are displayed unlike box-plots. Every point in the data can be displayed without overlap. 60 / 118

61 Assessing Normality - Normal Quantile Plot Construction: 1 Sort the data (Y ) from smallest to largest. 2 Number the sorted values from 1 to n. No adjustment is made for tied values. 3 Create the plotting positions as p i = (i 0.4)/(n + 0.2). 4 Compute the normal quantile (Q p ) using the NORMINV(p i ) or the NORM.S.INV(p i ) function in Excel. 5 Plot Q p (on the X-axis) vs Y on the Y-axis. If the distribution is correct for the data, the points should lie on an approximate straight line. Slope of the line estimates s (the sample standard deviation); Value of the curve at X = 0 estimates the mean. 61 / 118

62 Assessing Normality - Quantile Plots Quantile plot of the original data. 62 / 118

63 Assessing Normality - Quantile Plots Quantile plot of the log(as) data. 63 / 118

64 Assessing Normality - Quantile Plots 64 / 118

65 Assessing Normality - Quantile Plots 65 / 118

66 Assessing Normality - Quantile Plots 66 / 118

67 Assessing Normality - Quantile Plots 67 / 118

68 Assessing Normality - Summary Q-Q plots are easy way to assess normality. Look for departures from linearity (but don t over-interpret the plots). For inference about the MEAN, not too crucial that underlying distribution is normal except in (very) small sample sizes. For inference about extreme percentiles, it is crucial that normality assumptions be satisfied. 68 / 118

69 Linear Models Comparing Water Quality Readings 69 / 118

70 Comparing Water Quality Readings Objectives: How does WQ compare between 2+ sites Types of designs (paired vs. unpaired) Case study - French Creek 70 / 118

71 Comparing Water Quality Readings Types of designs Paired/blocked Paired (2 sites) or Blocked (2+ sites) designs are similar Synoptic (same day) or near synoptic (same week?) readings are taken Interested in the average difference. Not necessary to have a random sample to times. It may be preferable to select times to enhance contrast (e.g. sample at low and peak flows). Paired t-test; Single Factor Randomized Blocked ANOVA Independent samples (not part of this course). 2+ sites to be compared. Separate, random samples taken from each site. Interested in compare the MEANS across the sites. Independent sample t-test; Single Factor Completely Randomized Design ANOVA (a.k.a. One-way ANOVA) You MUST match the analysis to the design! 71 / 118

72 Comparing Water Quality Readings - Case Study French Creek - data set available Five locations along French Creek Monthly + two sets of 5-in-30 samples starting late-july and late-october (Near) Synoptic data How do the readings compare across the sites? 72 / 118

73 Comparing Water Quality Readings - 2 Paired Sites Compare readings in 2 sites. Align the data by date Find the difference in readings or log(ratio) of readings Use the difference if the range of values is small so that the differences between sites are relatively similar (e.g. a consistent difference of about 2 units). Use the log(ratio) if the range of values is large so that differences between sites are NOT relatively similar (e.g. range from 2 to 200 units), but RATIO of readings (e.g. one site s readings is about twice the other site). Compute the mean difference, se of mean difference, 95% confidence interval for mean difference and see if the 95% confidence interval includes 0. If want a p-value, test the hypothesis that mean difference in population is zero. 73 / 118

74 Comparing Water Quality Readings - 2 Paired Sites Turbidity Readings Sampling BARCLAY GRAFTON NEW WINCH Week BRG COOMBS ROAD HWY -ESTER / 118

75 Comparing Water Quality Readings - 2 Paired Sites Start with plots over time for each site: 75 / 118

76 Comparing Water Quality Readings - 2 Paired Sites Compute the difference or log(ratio) between two sites (e.g. Barclay vs. Coombs). Drop cases where missing values are present. 76 / 118

77 Comparing Water Quality Readings - 2 Paired Sites 77 / 118

78 Comparing Water Quality Readings - 2 Paired Sites Compute using Excel functions Number of differences/log-ratios - count() Mean difference/log-ratio - average() Std dev of diff/log-ration - stdev.s () 95% CI half width - confidence.t() 95% CI mean ± 95% CI half width t-test for testing equality of means - t.test() Also take anti-logs of mean and 95% CI for log-ratio 78 / 118

79 Comparing Water Quality Readings - 2 Paired Sites 79 / 118

80 Comparing Water Quality Readings - 2 Paired Sites Barclay averages.947 NTU higher than Coombs with a 95% c.i. of between 0.45 and 1.44 NTU higher than Coombs. Barclay is, on average, approx 1.98x (95% CI 1.70x to 2.30x) larger than Coombs. CAUTION: 95% c.i. say nothing about individual differences or individual ratios, i.e. NOT CORRECT TO SAY that 95% of differences lie between 0.45 and 1.44 NTU. 80 / 118

81 Comparing Water Quality Readings - 2 Paired Sites CAUTIONS: Excel does not deal with missing data very nicely. Look what happens if last reading for Coombs is missing. 81 / 118

82 Comparing Water Quality Readings - 2 Paired Sites Similar output from JMP with additional graphs to show that log(ratio) is likely better choice than difference: 82 / 118

83 Comparing Water Quality Readings - 2 Paired Sites JMP also handles missing values automatically: 83 / 118

84 Comparing Water Quality Readings - 2 Paired Sites A formal p-value is un-necessary but can be obtained as well. Because the p-values are very small, there is strong evidence of a difference (on average) between the two sites. 84 / 118

85 Comparing Water Quality Readings - 2 Paired Sites Summary Pairing is induced by synoptic readings Delete any pairs with missing values. More advanced software (e.g. R/SAS/JMP) can also incorporate missing data but this is beyond scope of course. CAUTION: EXCEL does NOT handle missing data well. Compute differences and/or log(ratio) Use differences if readings are similar over time Use log(ratio) if large variation in readings and RATIO is consistent over time Compute mean difference or log-ratio and 95% confidence interval for population mean difference or log-ratio. Is 0 included in the 95% confidence interval? If so, then there is no evidence of a difference (on average). CAUTION: 95% confidence interval say NOTHING about individual differences or log-ratios Use anti-log on mean and 95% c.i. to convert log-ratio back to ratios. 85 / 118

86 Comparing Water Quality Readings - 3+ Paired Sites It is possible to extend the analysis to 3+ more sites with synoptic readings. Make an array of week by site Record actual values or ln(values) CAUTION: EXCEL does not ALLOW for ANY missing data. You must exclude entire sampling week if any data is missing. This can lead to a LARGE loss of data. JMP/SAS/R can gracefully deal with missing data. 86 / 118

87 Comparing Water Quality Readings - 3+ Paired Sites Notice that readings on are dropped FOR all sites because missing data at New Highway 87 / 118

88 Comparing Water Quality Readings - 3+ Paired Sites Known as a Randomized Block Design Blocks = Synoptic Times = device for pairing up observations that are affected in similar way. Assume that differences among sites is relatively consistent across the blocks == NO INTERACTION between blocks and sites. (OR) Assume that ratio among sites is relatively consistent across the blocks so that differences of log(values) is relatively consistent (e.g. site A might always be about 2x larger than site B). Again start with plot of values as seen earlier 88 / 118

89 Comparing Water Quality Readings - 3+ Paired Sites Start with plots over time for each site: 89 / 118

90 Comparing Water Quality Readings - 3+ Paired Sites 90 / 118

91 Comparing Water Quality Readings - 3+ Paired Sites 91 / 118

92 Comparing Water Quality Readings - 3+ Paired Sites Here, column effect = effects of SITES P-value = 0.02, so there is some evidence of a consistent difference in the MEANS among sites Does NOT indicate which sites could have the same or different means. Need to follow-up with a Tukey Multiple Comparison Procedure to identify which pairs of sites could have different means. 92 / 118

93 Comparing Water Quality Readings - 3+ Paired Sites Look at estimates of MARGINAL means: Critical range is CR = Q MSE a # blocks where Q a is value from Studentized range with df 1 = #sites and df 2 = df MSE. In this case we look for (5, 52) df (see previous slide) at duke.edu/courses/spring98/sta110c/qtable.html and find 0.65 the Q a = 4.20 and CR = = Then two means could be different if Y 1 Y 2 > CR = / 118

94 Comparing Water Quality Readings - 3+ Paired Sites This is VERY tedious and error prone in EXCEL use a proper package such as JMP/ R/ SAS etc. The output is automatic and more-informative and can handle missing data. 94 / 118

95 Comparing Water Quality Readings - 3+ Paired Sites P-value is small, so there is some evidence of a difference in means among the sites. It does not indicate where the difference may lie. 95 / 118

96 Comparing Water Quality Readings - 3+ Paired Sites This indicates which Sites could have the same mean. Think of paint-chips to understand overlap in ranges of sites that could be the same. 96 / 118

97 Comparing Water Quality Readings - 3+ Paired Sites Provides estimates of effects and confidence intervals for each pairwise difference. 97 / 118

98 Comparing Water Quality Readings - Summary Typically used for synoptic data to see if sites are comparable Exactly 2 sites: Find difference or log(ratio) of readings from both sites. Drop any sites with missing data. Find mean and 95% confidence interval for difference in MEAN See if 0 is included in the confidence interval. Find p-value using Paired t-test. 98 / 118

99 Comparing Water Quality Readings - Summary 3+ Site: Analyze either raw data or log(data). Use Randomized Block Design analysis (Excel: Two-factor with no replication). Look at ANOVA table at either Rows/Columns effects that correspond to SITES. Program Tukey Multiple Comparison procedure by hand (groan)! CAUTION: EXCEL does not deal with missing values correctly - GIVES WRONG ANSWERS. CAUTION: EXCEL is very clumsy in finding where the differences lie. You must program Tukey procedure. You will make mistakes in doing this! CAUTION: EXCEL does not provide other output to check assumptions of the models. Avoid EXCEL use a proper package such as JMP/ R/ SAS! 99 / 118

100 Case Study - Barclay Bridge Turbidity Case Study - French Creek - Barclay Bridge - Turbidity 100 / 118

101 Case Study - Barclay Bridge Turbidity Monthly sampling in 30 samples 101 / 118

102 Case Study - Barclay Bridge Turbidity What are the objectives? Find the distribution of turbidity across the year? Estimate the 95 th percentile across the year? Estimate the mean and percentiles only in November? Problems: Non-random sampling across the year with some days having a higher probability of sampling than other days. More samples deliberately selected in August (when the turbidity is low) and more samples deliberately taken in November when turbidity is high. High autocorrelation in values taken close together in the 5-in-30 samples. Standard errors will be understated, i.e. you will think you are more precise than you are. Not clear how to interpret means and percentiles for part of a year. The 95 th percentile based on the 5-in-30 will NOT estimate the 95 th percentile for the year. 102 / 118

103 Case Study - Barclay Bridge Turbidity Example of bias in computing percentiles. Simulated data to follow data based on previous curve with some random noise around the curve. Computed 95 th percentile based on yearly data and on sample dates data. 103 / 118

104 Case Study - Barclay Bridge Turbidity Example of bias in computing means. Simulated data to follow data based on previous curve with some random noise around the curve. Computed mean based on yearly data and on sample dates data. 104 / 118

105 Case Study - Barclay Bridge Turbidity Not clear what to do with this type of data without some consideration of filling in some of the missing data. 105 / 118

106 Case Study - Mercantile Readings are taken monthly, except 5-in-30 days samples are again taken in August and November. There are many censored readings (indicated by the < character in the adjacent column. There duplicate and split samples for QA/QC work. There is seasonality in some of the characteristics. 106 / 118

107 Case Study - Mercantile Duplicate and split-sample measurements. These are NOT independent observations and the usual way to deal with these is to the the average of the duplicate or split-sample measurements. Outliers. The sole outliers for NO3 and Turbidity are synoptic. What happened? For some variables (e.g. NO3), weak seasonality present so some pooling over all month in a year is a possibility. I am still worried about the 5-in-30 readings as serial correlation may be a problem? Some variables are highly censored while others are lightly censored. There is no easy way to deal with censoring in Excel other than to either ignore it (i.e. treat the limit as the data value) or use 1/2 of the detection limit as the data value. 107 / 118

108 Case Study - Mercantile Turbidity and NO3 in Mercantile Creek: 108 / 118

109 Case Study - Mercantile Several interesting features. There appears to be seasonality with high readings in October and April, but they are not consistent over time (e.g. look what happened in 2003). Sampling intensity is not uniform over the years with some months missed and no apparent 5-in-30 sampling occasions. A very large value occurred in November 2002 with very small values on the months on either side of it indicating a very volatile system. As in the French Creek dataset, it is not clear what the 95th percentile is supposed to be measuring? Just the peak events? Across the entire year? There appears to be an increasing trend in both the mean and variability. Fitting a trend line to this data set is beyond the capabilities of Excel because of the problems above. 109 / 118

110 Case Study - Mercantile Try a linear fit to the Turbidity data after dropping the outliers in This was done in R (don t use Excel - gives WRONG results for many regressions!) More details on fitting linear models available in my course notes. Estimate Std. Error t value Pr(> t (Intercept) Date Standard errors may be too small because of the autocorrelation in the residuals. Strong evidence of an increase over time. 110 / 118

111 Case Study - Mercantile Try a linear fit to the Turbidity data after dropping the outliers in / 118

112 Case Study - Mercantile Try a linear fit to the Turbidity data after dropping the outliers in Residual plots (lower left) shows increase in variance with mean. 112 / 118

113 Case Study - Mercantile Try a linear fit to the NO 3 data after dropping the outliers in This was done in R (don t use Excel - gives WRONG results for many regressions!) More details on fitting linear models available in my course notes. Estimate Std. Error t value Pr(> t ) (Intercept) e e Date e e Standard errors may be too small because of the autocorrelation in the residuals. No evidence of an increase over time. 113 / 118

114 Case Study - Mercantile Try a linear fit to the NO 3 data after dropping the outliers in / 118

115 Case Study - Mercantile Try a linear fit to the NO 3 data after dropping the outliers in Residual plots now look all ok. 115 / 118

116 Summary RRR s No easy way to compute means and percentiles for non-randomly selected data (e.g. monthly + 5-in-30 data). More flexibility when comparing across sites or trends over time. Some caution needed if readings are too close together (e.g. hourly or daily). See my webinar for Air Quality. What are you interested in? Averages - estimate means, SE, and CI Spread - estimate means, SD, and EI Extremes - estimate percentiles and TI (either parametric or non-parametric). CAUTION. These results are extremely sensitive to violations of the RRRs and distributional assumptions. Assess normality using Normal Quantile plots. Trends - use linear models (see my webinar for Air Quality) No amount of statistical wizardry can rescue poorly corrected data! You will be severely constrained in your analyses if you only use Excel! 116 / 118

117 Summary Missing values? If missingness is MCAR, then no problems. MAR/ IM are more/ very difficulty to deal with. Outliers? Run analysis with outliers in and with outliers out. If no difference then who cares. 117 / 118

118 Summary Other possible analyzes not-part of this course: Quantile-regression where you model the change in percentiles, e.g. has the 90 th percentile changed over time? Testing for changes in the standard deviation over time rather than the mean to see if variability has changed over time. Modeling the number of events (e.g. days exceeding WQ guidelines in a month) and if they change over time. Further help, contact Carl Schwarz stat.sfu.ca) 118 / 118

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER seven Statistical Analysis with Excel CHAPTER chapter OVERVIEW 7.1 Introduction 7.2 Understanding Data 7.3 Relationships in Data 7.4 Distributions 7.5 Summary 7.6 Exercises 147 148 CHAPTER 7 Statistical

More information

Chapter 7. One-way ANOVA

Chapter 7. One-way ANOVA Chapter 7 One-way ANOVA One-way ANOVA examines equality of population means for a quantitative outcome and a single categorical explanatory variable with any number of levels. The t-test of Chapter 6 looks

More information

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011

Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this

More information

Technical Guidance for Exploring TMDL Effectiveness Monitoring Data

Technical Guidance for Exploring TMDL Effectiveness Monitoring Data December 2011 Technical Guidance for Exploring TMDL Effectiveness Monitoring Data 1. Introduction Effectiveness monitoring is a critical step in the Total Maximum Daily Load (TMDL) process for addressing

More information

Exercise 1.12 (Pg. 22-23)

Exercise 1.12 (Pg. 22-23) Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.

More information

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96 1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years

More information

Part II Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Part II

Part II Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Part II Part II covers diagnostic evaluations of historical facility data for checking key assumptions implicit in the recommended statistical tests and for making appropriate adjustments to the data (e.g., consideration

More information

Data Analysis Tools. Tools for Summarizing Data

Data Analysis Tools. Tools for Summarizing Data Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool

More information

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Online Learning Centre Technology Step-by-Step - Excel Microsoft Excel is a spreadsheet software application

More information

Confidence Intervals for the Difference Between Two Means

Confidence Intervals for the Difference Between Two Means Chapter 47 Confidence Intervals for the Difference Between Two Means Introduction This procedure calculates the sample size necessary to achieve a specified distance from the difference in sample means

More information

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals

More information

Comparing Means in Two Populations

Comparing Means in Two Populations Comparing Means in Two Populations Overview The previous section discussed hypothesis testing when sampling from a single population (either a single mean or two means from the same population). Now we

More information

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management

KSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To

More information

STAT 350 Practice Final Exam Solution (Spring 2015)

STAT 350 Practice Final Exam Solution (Spring 2015) PART 1: Multiple Choice Questions: 1) A study was conducted to compare five different training programs for improving endurance. Forty subjects were randomly divided into five groups of eight subjects

More information

Recall this chart that showed how most of our course would be organized:

Recall this chart that showed how most of our course would be organized: Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical

More information

How To Test For Significance On A Data Set

How To Test For Significance On A Data Set Non-Parametric Univariate Tests: 1 Sample Sign Test 1 1 SAMPLE SIGN TEST A non-parametric equivalent of the 1 SAMPLE T-TEST. ASSUMPTIONS: Data is non-normally distributed, even after log transforming.

More information

Minitab Tutorials for Design and Analysis of Experiments. Table of Contents

Minitab Tutorials for Design and Analysis of Experiments. Table of Contents Table of Contents Introduction to Minitab...2 Example 1 One-Way ANOVA...3 Determining Sample Size in One-way ANOVA...8 Example 2 Two-factor Factorial Design...9 Example 3: Randomized Complete Block Design...14

More information

1. How different is the t distribution from the normal?

1. How different is the t distribution from the normal? Statistics 101 106 Lecture 7 (20 October 98) c David Pollard Page 1 Read M&M 7.1 and 7.2, ignoring starred parts. Reread M&M 3.2. The effects of estimated variances on normal approximations. t-distributions.

More information

Simple linear regression

Simple linear regression Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between

More information

DATA INTERPRETATION AND STATISTICS

DATA INTERPRETATION AND STATISTICS PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE

More information

An introduction to using Microsoft Excel for quantitative data analysis

An introduction to using Microsoft Excel for quantitative data analysis Contents An introduction to using Microsoft Excel for quantitative data analysis 1 Introduction... 1 2 Why use Excel?... 2 3 Quantitative data analysis tools in Excel... 3 4 Entering your data... 6 5 Preparing

More information

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon t-tests in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com www.excelmasterseries.com

More information

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance

Statistics courses often teach the two-sample t-test, linear regression, and analysis of variance 2 Making Connections: The Two-Sample t-test, Regression, and ANOVA In theory, there s no difference between theory and practice. In practice, there is. Yogi Berra 1 Statistics courses often teach the two-sample

More information

5. Linear Regression

5. Linear Regression 5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4

More information

Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0.

Predictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0. Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged

More information

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression

An analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression Chapter 9 Simple Linear Regression An analysis appropriate for a quantitative outcome and a single quantitative explanatory variable. 9.1 The model behind linear regression When we are examining the relationship

More information

NCSS Statistical Software

NCSS Statistical Software Chapter 06 Introduction This procedure provides several reports for the comparison of two distributions, including confidence intervals for the difference in means, two-sample t-tests, the z-test, the

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information

SPSS Manual for Introductory Applied Statistics: A Variable Approach

SPSS Manual for Introductory Applied Statistics: A Variable Approach SPSS Manual for Introductory Applied Statistics: A Variable Approach John Gabrosek Department of Statistics Grand Valley State University Allendale, MI USA August 2013 2 Copyright 2013 John Gabrosek. All

More information

2013 MBA Jump Start Program. Statistics Module Part 3

2013 MBA Jump Start Program. Statistics Module Part 3 2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just

More information

An analysis method for a quantitative outcome and two categorical explanatory variables.

An analysis method for a quantitative outcome and two categorical explanatory variables. Chapter 11 Two-Way ANOVA An analysis method for a quantitative outcome and two categorical explanatory variables. If an experiment has a quantitative outcome and two categorical explanatory variables that

More information

Skewed Data and Non-parametric Methods

Skewed Data and Non-parametric Methods 0 2 4 6 8 10 12 14 Skewed Data and Non-parametric Methods Comparing two groups: t-test assumes data are: 1. Normally distributed, and 2. both samples have the same SD (i.e. one sample is simply shifted

More information

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING In this lab you will explore the concept of a confidence interval and hypothesis testing through a simulation problem in engineering setting.

More information

2. Simple Linear Regression

2. Simple Linear Regression Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according

More information

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 seema@iasri.res.in 1. Descriptive Statistics Statistics

More information

Chapter 7 Section 7.1: Inference for the Mean of a Population

Chapter 7 Section 7.1: Inference for the Mean of a Population Chapter 7 Section 7.1: Inference for the Mean of a Population Now let s look at a similar situation Take an SRS of size n Normal Population : N(, ). Both and are unknown parameters. Unlike what we used

More information

Exploratory Data Analysis

Exploratory Data Analysis Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction

More information

SUMAN DUVVURU STAT 567 PROJECT REPORT

SUMAN DUVVURU STAT 567 PROJECT REPORT SUMAN DUVVURU STAT 567 PROJECT REPORT SURVIVAL ANALYSIS OF HEROIN ADDICTS Background and introduction: Current illicit drug use among teens is continuing to increase in many countries around the world.

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

Simple Linear Regression

Simple Linear Regression STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals (9.3) Conditions for inference (9.1) Want More Stats??? If you have enjoyed learning how to analyze

More information

Principles of Hypothesis Testing for Public Health

Principles of Hypothesis Testing for Public Health Principles of Hypothesis Testing for Public Health Laura Lee Johnson, Ph.D. Statistician National Center for Complementary and Alternative Medicine johnslau@mail.nih.gov Fall 2011 Answers to Questions

More information

The Variability of P-Values. Summary

The Variability of P-Values. Summary The Variability of P-Values Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 27695-8203 boos@stat.ncsu.edu August 15, 2009 NC State Statistics Departement Tech Report

More information

Using Excel for inferential statistics

Using Excel for inferential statistics FACT SHEET Using Excel for inferential statistics Introduction When you collect data, you expect a certain amount of variation, just caused by chance. A wide variety of statistical tests can be applied

More information

Normality Testing in Excel

Normality Testing in Excel Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

2. Filling Data Gaps, Data validation & Descriptive Statistics

2. Filling Data Gaps, Data validation & Descriptive Statistics 2. Filling Data Gaps, Data validation & Descriptive Statistics Dr. Prasad Modak Background Data collected from field may suffer from these problems Data may contain gaps ( = no readings during this period)

More information

A Basic Introduction to Missing Data

A Basic Introduction to Missing Data John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item

More information

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable

More information

BIOL 933 Lab 6 Fall 2015. Data Transformation

BIOL 933 Lab 6 Fall 2015. Data Transformation BIOL 933 Lab 6 Fall 2015 Data Transformation Transformations in R General overview Log transformation Power transformation The pitfalls of interpreting interactions in transformed data Transformations

More information

NCSS Statistical Software

NCSS Statistical Software Chapter 06 Introduction This procedure provides several reports for the comparison of two distributions, including confidence intervals for the difference in means, two-sample t-tests, the z-test, the

More information

Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1

Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Calculate counts, means, and standard deviations Produce

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Suppose following data have been collected (heights of 99 five-year-old boys) 117.9 11.2 112.9 115.9 18. 14.6 17.1 117.9 111.8 16.3 111. 1.4 112.1 19.2 11. 15.4 99.4 11.1 13.3 16.9

More information

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

 Y. Notation and Equations for Regression Lecture 11/4. Notation: Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

More information

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association

More information

Geostatistics Exploratory Analysis

Geostatistics Exploratory Analysis Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt

More information

Tutorial 5: Hypothesis Testing

Tutorial 5: Hypothesis Testing Tutorial 5: Hypothesis Testing Rob Nicholls nicholls@mrc-lmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................

More information

Statistics Review PSY379

Statistics Review PSY379 Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information

Common Tools for Displaying and Communicating Data for Process Improvement

Common Tools for Displaying and Communicating Data for Process Improvement Common Tools for Displaying and Communicating Data for Process Improvement Packet includes: Tool Use Page # Box and Whisker Plot Check Sheet Control Chart Histogram Pareto Diagram Run Chart Scatter Plot

More information

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name: Glo bal Leadership M BA BUSINESS STATISTICS FINAL EXAM Name: INSTRUCTIONS 1. Do not open this exam until instructed to do so. 2. Be sure to fill in your name before starting the exam. 3. You have two hours

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

Getting Correct Results from PROC REG

Getting Correct Results from PROC REG Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking

More information

Introduction to StatsDirect, 11/05/2012 1

Introduction to StatsDirect, 11/05/2012 1 INTRODUCTION TO STATSDIRECT PART 1... 2 INTRODUCTION... 2 Why Use StatsDirect... 2 ACCESSING STATSDIRECT FOR WINDOWS XP... 4 DATA ENTRY... 5 Missing Data... 6 Opening an Excel Workbook... 6 Moving around

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

How To Run Statistical Tests in Excel

How To Run Statistical Tests in Excel How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting

More information

Premaster Statistics Tutorial 4 Full solutions

Premaster Statistics Tutorial 4 Full solutions Premaster Statistics Tutorial 4 Full solutions Regression analysis Q1 (based on Doane & Seward, 4/E, 12.7) a. Interpret the slope of the fitted regression = 125,000 + 150. b. What is the prediction for

More information

HYPOTHESIS TESTING WITH SPSS:

HYPOTHESIS TESTING WITH SPSS: HYPOTHESIS TESTING WITH SPSS: A NON-STATISTICIAN S GUIDE & TUTORIAL by Dr. Jim Mirabella SPSS 14.0 screenshots reprinted with permission from SPSS Inc. Published June 2006 Copyright Dr. Jim Mirabella CHAPTER

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

More information

Exploratory Data Analysis

Exploratory Data Analysis Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 1 / 46 Outline Data, revisited The purpose of exploratory data analysis Learning

More information

Statistical Functions in Excel

Statistical Functions in Excel Statistical Functions in Excel There are many statistical functions in Excel. Moreover, there are other functions that are not specified as statistical functions that are helpful in some statistical analyses.

More information

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties: Density Curve A density curve is the graph of a continuous probability distribution. It must satisfy the following properties: 1. The total area under the curve must equal 1. 2. Every point on the curve

More information

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material

More information

Chapter 7 Section 1 Homework Set A

Chapter 7 Section 1 Homework Set A Chapter 7 Section 1 Homework Set A 7.15 Finding the critical value t *. What critical value t * from Table D (use software, go to the web and type t distribution applet) should be used to calculate the

More information

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.

Once saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences. 1 Commands in JMP and Statcrunch Below are a set of commands in JMP and Statcrunch which facilitate a basic statistical analysis. The first part concerns commands in JMP, the second part is for analysis

More information

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r),

Chapter 10. Key Ideas Correlation, Correlation Coefficient (r), Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables

More information

Chapter 23. Inferences for Regression

Chapter 23. Inferences for Regression Chapter 23. Inferences for Regression Topics covered in this chapter: Simple Linear Regression Simple Linear Regression Example 23.1: Crying and IQ The Problem: Infants who cry easily may be more easily

More information

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4) Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

More information

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques

More information

An Introduction to Statistics using Microsoft Excel. Dan Remenyi George Onofrei Joe English

An Introduction to Statistics using Microsoft Excel. Dan Remenyi George Onofrei Joe English An Introduction to Statistics using Microsoft Excel BY Dan Remenyi George Onofrei Joe English Published by Academic Publishing Limited Copyright 2009 Academic Publishing Limited All rights reserved. No

More information

UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST

UNDERSTANDING THE INDEPENDENT-SAMPLES t TEST UNDERSTANDING The independent-samples t test evaluates the difference between the means of two independent or unrelated groups. That is, we evaluate whether the means for two independent groups are significantly

More information

6.4 Normal Distribution

6.4 Normal Distribution Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

More information

Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

More information

Applying Statistics Recommended by Regulatory Documents

Applying Statistics Recommended by Regulatory Documents Applying Statistics Recommended by Regulatory Documents Steven Walfish President, Statistical Outsourcing Services steven@statisticaloutsourcingservices.com 301-325 325-31293129 About the Speaker Mr. Steven

More information

Permutation Tests for Comparing Two Populations

Permutation Tests for Comparing Two Populations Permutation Tests for Comparing Two Populations Ferry Butar Butar, Ph.D. Jae-Wan Park Abstract Permutation tests for comparing two populations could be widely used in practice because of flexibility of

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data

Introduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data A Few Sources for Data Examples Used Introduction to Environmental Statistics Professor Jessica Utts University of California, Irvine jutts@uci.edu 1. Statistical Methods in Water Resources by D.R. Helsel

More information