A (very) short course on the analysis of Water Quality Data
|
|
- Daniela Palmer
- 8 years ago
- Views:
Transcription
1 A (very) short course on the analysis of Water Quality Data Carl James Schwarz Department of Statistics and Actuarial Science Simon Fraser University Burnaby, BC, Canada stat.sfu.ca 1 / 118
2 Introduction Introduction 2 / 118
3 Introduction Objectives: Review the concepts of mean, standard deviations, standard errors, confidence intervals and percentiles in terms of water quality studies. Show how compare water quality using synoptic reading at two or more sites. Use some sample datasets provided by MOE to examine how to deal with problematic datasets with suggestions on how to proceed. 3 / 118
4 Introduction - Loading Analysis Tool Pack Excel 2010 ships with some additional statistical (and other) analysis in a tookpack, but this may not be started. Check the Data Analysis menu to see if the Analysis Took Pack shows. 4 / 118
5 Introduction - Loading Analysis Tool Pack Use the File Options Add-Ins. 5 / 118
6 Introduction - Loading Analysis Tool Pack Select and load the Analysis TookPack This is not the best package (very limited functionality). You are better off using a proper package such as JMP/ R/ SAS, etc. 6 / 118
7 Introduction - Recommended Reading Helsel, D. R. and Hirsch, R. M. (online). Statistical Methods in Water Resources twri4a3/ Included as pdf file in package. 7 / 118
8 Introduction - Recommended Reading McBride, G. B. (2005). Using Statistical Methods for Water Quality Management. Wiley, New York 8 / 118
9 Introduction - Recommended Reading Helsel, D. R.(2005). Non-detects and data analysis: Statistics for censored environmental data. Wiley, New York. 9 / 118
10 Introduction - Recommended Reading 10 / 118
11 Introduction - Obtaining Data What is population of interest? Water quality parameter over the entire year? What is question of interest? Interested in average water quality? Interested in spread of water quality? Interested in higher percentiles (e.g. 95 th percentile?) Interested in comparing averages across sites? Interested in comparing percentiles across sites? (not part of this course) Interested in relationship of water quality to other covariates (such as rainfall, temperature)? (not part of this course) 11 / 118
12 Introduction - Obtaining Data The 3 R s of good data collection Randomization - this is what makes the sample representative of the population Every member of population should have a known probability of selection. Often this is equal. Replication - determines how reproducible are your estimates. This controls the reproducibility of your results. Stratification - controls for some explain sources of variation in your readings. Synoptic readings to compare sites controls for time effects. No amount of statistical wizardry can rescue badly collected data! 12 / 118
13 Introduction - Characteristics of Water Quality Data Often has a lower bound of zero and no negative values are possible (usually not a problem). Outliers, typically on the high side (can greatly affect results if not accounted for). Long right tails (skewness). Rather than being symmetric, there is a long tail to the right (try a log-transform). Non-normally distribution data. (Try a log-transform or resistant methods). Censoring (e.g. below detection limits (BDL)). (not part of this course) Seasonal Patterns (e.g. values tend to be higher in the summer than winter). Try stratification or modeling. Autocorrelation (e.g. measurements taken closely together in space and time tend to be related). (not part of this course, but see webinar for Air Quality branch) 13 / 118
14 Introduction - Dealing with outliers No formal definition of an outlier other than a data value that is unusual. Dealing with outliers: Through away old-fashioned rules such as Grubbs rule. Check data and see if a typing error etc. Perhaps population is more complex than previously thought. Try analysis with outlier in and outlier out. If it makes no difference, who cares. 14 / 118
15 Introduction - Missing Data Missing data mechanisms: MCAR (Missing completely at random). The missingness is unrelated to the response or another covariate in a study. For example, the sample vial broke when accidentally dropped. No bias introduced into results but standard error is larger than with complete data. MAR (Missing at random). The missingness is unrelated to the response, but may be related to another covariate. For example, rather than having equal number of samples in each month, fewer samples can be obtained in January because of access issues. If you know the adjustment factor, the collected data can be reweighed (not part of this course). IM (Informative missing). The missingness is related to the response value. For example, the concentration is so large that it exceeds the detection limits of the machine. Censoring is relatively easy to deal with modern software (not Excel). Other forms of informative missings are very difficult to deal with. (not part of this course). 15 / 118
16 Introduction - Transformations You say toe-mah-toe and I say toe-may-toe. Don t be afraid of transformations, e.g. ph is on a log-scale. Most common transformation in water quality is log() transform - discussed later. Sometimes not obvious what is correct scale (e.g. mpg or L/100 km?) See Hoaglin, David C., 1988, Transformations in everyday experience: Chance 1, / 118
17 Summary Statistics and Inference Basic Summary Statistics 17 / 118
18 Summary Statistics Objectives: Mean, Standard Deviation, Empirical Intervals Mean, Standard Errors, Confidence Intervals Mean vs. Geometric Mean Median vs. Mean Quantiles and Tolerance Intervals 18 / 118
19 Summary Statistics - Mean, Std Dev, Empirical Interval Sample data set: Arsenic (As) Concentrations from Ground Water (See H & H, Chapter 3, page 71) Available as MOE-workshop.xlsx document. 19 / 118
20 Summary Statistics - Mean Data must be collected using RRRs otherwise summary statistics are nonsense! Mean (Y ) = simple average of values. Use the average() function in Excel. Y As = Mean are greatly influenced by outliers so these need to be dealt with prior to analysis. 20 / 118
21 Summary Statistics - Standard Deviation Data must be collected using RRRs otherwise summary statistics are nonsense! (Y Standard Deviation s = i Y ) 2 n 1. This is a measure of spread of the INDIVIDUAL values in the sample (and by inference in the population). Use the stdev.s() function in Excel. s As = Standard deviations are greatly influenced by outliers so these need to be dealt with prior to analysis. 21 / 118
22 Summary Statistics - Empirical Rule Data must be collected using RRRs otherwise summary statistics are nonsense! Empirical Rule: Y ± 2s contains about 95% of INDIVIDUAL data values. Use the average() +/- 2*stdev.s() function in Excel. Empirical Rule for As is ( ) which contains 24/25 = 96% of observations. This gives a rough range for FUTURE observations. Empirical intervals are greatly influenced by outliers so these need to be dealt with prior to analysis. 22 / 118
23 Summary Statistics - SE & Confidence Interval Data must be collected using RRRs otherwise summary statistics are nonsense! Every time we take a NEW sample, the sample mean (Y ) will change. How reproducible are our results? The standard error (SE) measures the variability of future SAMPLE MEANS. The standard deviation (SD) measures the variability of future OBSERVATIONS. s n. Under Simple Random Sampling the SE of the sample mean is Different sample schemes and different statistics have different SE formula. Excel does NOT have a built-in function. SE = average(cellrange)/sqrt(count(cellrange)) = SE are greatly influenced by outliers so these need to be dealt with prior to analysis. 23 / 118
24 Summary Statistics - SE & Confidence Interval Data must be collected using RRRs otherwise summary statistics are nonsense! A 95% confidence interval is an estimate of the likely values of MEANS from 95% future samples. A 95% empirical interval is an estimate of the likely values of OBSERVATIONS from future samples. Under Simple Random Sampling, a 95% c.i. is found as Y ± t α/2,n 1 s n 95% c.i. lower bound = average(cellrange) -/+ confidence.t(.05,stddev.s(cellrange),count(cellrange)) For As data the 95% c.i. is ( ). CI are greatly influenced by outliers so these need to be dealt with prior to analysis. 24 / 118
25 Summary Statistics - Using Analysis ToolPack 25 / 118
26 Summary Statistics - Using Analysis ToolPack 26 / 118
27 Summary Statistics - Using Analysis ToolPack Usual summary statistics. Notice that the half-width of the 95% CI is provided and you still must compute the upper and lower bound = = / 118
28 Summary Statistics - Recap Data must be collected using RRRs otherwise summary statistics are nonsense! The mean measures the average value in the data set. The standard deviation (SD) measures the variability of future OBSERVATIONS. The standard error (SE) measures the variability of future SAMPLE MEANS. A 95% empirical interval (EI) is an estimate of the likely values of OBSERVATIONS from future samples. A 95% confidence interval (CI) is an estimate of the likely values of MEANS from 95% future samples. The SD does not depend on n. The SE and CI decline as a function of 1 n These statistics are greatly influenced by outliers. 28 / 118
29 Summary Statistics - Mean vs. Geometric Mean Means, SD implicitly assume a symmetric distribution so that averages and spread are meaningful. SE, CI are acceptable regardless of distribution in large samples. [How big is large??]. But they perform better with symmetric distributions in small samples. For skewed data (e.g. long right-tail) a logarithmic transform often improves inference. 29 / 118
30 Summary Statistics - Mean vs. Geometric Mean 30 / 118
31 Summary Statistics - Mean vs. Geometric Mean Two common transformation in water quality are log() vs. ln() Natural logarithms - ln() - base e, e.g. ln(100) = 4.60 and e 4.60 = exp(4.60) = 100. Common logarithsm - log() - base 10, e.g. log(100) = 2 and 10 2 = 100. Relationship: ln(y ) 2.3 log(y ). So it doesn t matter what you use, but be clear in documents what is used. CAUTION: Many computer packages (except Excel) use log() to represent the natural logarithmic transformation, and log10() to represent the common logarithmic transformation. 31 / 118
32 Summary Statistics - Mean vs. Geometric Mean Effect of transformation: 32 / 118
33 Summary Statistics - Mean vs. Geometric Mean Arithmetic Mean - Y = Yi n. Geometric Mean - GM Y = n Y 1 Y 2... Y n. Can also be found by using logarithmic transformation: Geometric Mean - GM Y = antilog(log(y )) Example: Y = (5, 12, 20). Y = GM Y = = GM Y = exp( ln(5)+ln(12)+ln(20) 3 ) = exp( ) = exp(2.363) = GM Y Y (this is always true) GM Y is less sensitive to outliers than Y and is often close to median. 33 / 118
34 Summary Statistics - Mean vs. Geometric Mean You can compute the usual statistics on the log() data... etc. 34 / 118
35 Summary Statistics - Mean vs. Geometric Mean You can compute the usual statistics on the log() data and back transform mean, EI, and CI bounds. DO NOT BACK TRANSFORM SD or SE. Careful to interpret back transformation of EI and CI bounds. 35 / 118
36 Summary Statistics - Mean vs. Geometric Mean - Recap Don t be afraid to use log() transformation, e.g. ph log() vs. ln(). CAUTION. Packages assume log() is natural logarithm. Suitable for highly skewed data. CAUTION on back transformation of mean and CI bounds - these refer to geometric mean in population (or very close to median). 36 / 118
37 Summary Statistics - Quantiles Quantiles (Percentiles) are measures of LOCATION within the dataset. Q p is the value of Y with at least 100p% of INDIVIDUAL values Q p and at least 100(1 p)% of INDIVIDUAL values Q p. Examples: Q.50 = 50th percentile = median has at least 50% of INDIVIDUAL values Q.50 and at least 50% of INDIVIDUAL values Q.50. Q.95 = 95th percentile has at least 95% of INDIVIDUAL values Q.95 and at least 5% of INDIVIDUAL values Q.95. Do not confuse the 95th percentile with 95% EI or 95% CI. 37 / 118
38 Summary Statistics - Quantiles Finding the p quantiles. [There are at least 5 different ways(!) to compute percentiles.] Sort the data from smallest to largest. Compute (n + 1)p. If (n + 1)p is integer, use Y [(n+1)p] +, i.e. the ((n + 1)p) th observation after sorting. if (n + 1)p is fractional, interpolated between two bracketing observations. In large samples, all of the methods will all give similar values. Example As data. n=25. Median = 50 th percentile. (n + 1)p = 13. Use the 13th observation = 19. The median of (13, 11, 17) is 13. The median of (13, 11, 17, 15) is ( )/2 = 14. The median of (13, 11, ) is still 13. The median of (13, 11, < 5, < 5, 9) is 9 where < 5 indicates BDL. 38 / 118
39 Summary Statistics - Quantiles Quantiles are (theoretically) invariant to log() transformation, i.e. if you find the quantile on transformed data and then back transform, you get the same value as the quantile on the original data. But, because of interpolation, the values may be slightly different. Excel function percentile.inc(datarange, p) looks at 100p/(n 1) and interpolates if this is not an integer. 39 / 118
40 Summary Statistics - Quantiles Comparison of Excel vs JMP (Different interpolation rules). 40 / 118
41 Summary Statistics - Quantiles Quantile plots: 1 Sort the data (Y ) from smallest to largest. Even if data is censored, it usually can be ordered. 2 Number the sorted values from 1 to n. 3 Create the plotting positions as p i = (i 0.4)/(n + 0.2). [Cunnane plotting positions (other formula exist)]. When tied values are present, each is assigned a separate plotting value and the tied values will create a vertical cliff on the plots. 4 Plot the Y variable on the bottom axis, and the plotting positions p i on the Y axis and joined the plots. 41 / 118
42 Summary Statistics - Quantiles Refer to [As] data: 42 / 118
43 Summary Statistics - Quantiles Refer to [As] data: 43 / 118
44 Summary Statistics - Quantiles Refer to log([as]) data: 44 / 118
45 Summary Statistics - Quantiles If you are willing to assume a specified distribution (e.g. normal) then you can estimate percentiles based on sample mean and sample standard deviation. p z Q p = Y + zs 45 / 118
46 Summary Statistics - Quantiles For small samples, the sample standard deviation (s) is a biased estimator of σ. The adjusted estimate of the percentile is: Q p,adjusted s = Y + z M(df ) where M(df ) is the mean of a standardized chi-distribution. [See my notes for more details.] 46 / 118
47 Summary Statistics - Quantiles Refer to Excel spreadsheet on As SummaryStatistcs 47 / 118
48 Summary Statistics - Tolerance Intervals Avoid NAKED ESTIMATES, e.g. never report a mean with out a SE. Tolerance intervals are confidence intervals for percentiles. We are 95% confident that no more than 10% of observations exceed xxxxx. This is a one-sided 95% confidence interval for the 90 th percentile. 48 / 118
49 Summary Statistics - TI - Large Samples Let p be the quantile of interest. Compute R upper = n(p) + z n(p)(1 p) The upper bound of the tolerance interval is then Y [R upper ]. For a 95% confidence interval use z = as seen earlier. Example, with n = 500, find a 95% one-sided tolerance interval for the 99 th percentile, i.e. you will be 95% confidence that 1% or less of future observations will exceed this value. The estimated quantile is the 500(.99) = 495 th value. z = 1.645: R upper = 500(.99) (.99)(1.99) = = The tolerance upper bound is then Y [498.36]. Some interpolation may be required to find Y [498.36]. 49 / 118
50 Summary Statistics - TI - Large Samples CAUTION: It may require extrapolation outside range of dataset (and cannot be computed). Example, if n = 10 then R upper = 10(.99) (.99)(1.99) = = st observation which is out of the range of the dataset (!). [Also for small samples replace the normal distribution with a binomial distribution as shown in Conover (1999)] 50 / 118
51 Summary Statistics - TI - Large Samples Refer to Excel spreadsheet on As SummaryStatistcs 51 / 118
52 Summary Statistics - TI - Observed Min/Max You will be (1 p n ) 100% confident that at least the fraction p of the future observations will lie below the maximum of the observed data. For the As dataset, n = 25 You are ( ) 100% = 72% confident that at least 95% of future observations will lie below the largest value, or 72% confident that no more than 5% of observations will exceed the maximum. You are ( ) 100% = 22% confident that at least 99% of future observations will lie below the largest value, or only 22% confident that no more than 1% of observations will exceed the maximum If n = 5 (not much useful information available) You are ( ) 100% = 23% confident that at least 95% of future observations will lie below the largest value,or only 23% confident that no more than 5% of observations will exceed the maximum. 52 / 118
53 Summary Statistics - TI - Observed Min/Max Refer back to [As] spreadsheet 53 / 118
54 Summary Statistics - TI - Using normal distribution The previous slides show that very little information is available in small samples if you are not willing to assume a distribution for the data. If the assumption of normality is sensible, you can do better. Basic form is : TI p = Y + ks where k is obtained from tables. prc263.htm Gerow, K. and Bielen, Confidence Intervals for Percentiles: An Application to Estimation of Potential Maximum Biomass of Trout in Wyoming Streams. North American Journal of Fisheries Management 19, CIFPAA>2.0.CO;2 and Excel spreadsheet. CAUTION: TI are very sensitive to the normality assumption! 54 / 118
55 Summary Statistics - TI - Using normal distribution Refer to the [As] workbook. 55 / 118
56 Summary Statistics - Quantile and TI - Recap Quantiles measure LOCATION of INDIVIDUAL observations (e.g. median or 95 th percentile. Tolerance Intervals (TI) are confidence intervals on percentiles, e.g. you are 90% sure that no more than 5% of INDIVIDUAL values exceed xxxxx. Quantiles can be estimated: Using non-parametric methods. Find Q p = Y np. [The (np) varies among methods.] Assuming a distribution (e.g. normal). Q p = Y + zs Tolerance intervals can be estimated: Using non-parametric methods. Require large samples. Based on min/max of observed data. Frightening how little knowledge is available in small samples. Based on normal distribution. Assumption of normality is crucial for extreme quantiles (!) 56 / 118
57 Summary Statistics - Final Summary Data must be collected using RRRs; otherwise summary statistics have no meaning. Are you interested in the AVERAGE Estimate mean, SE, and CI CI say NOTHING about INDIVIDUAL observations. Are you interested in the SPREAD Estimate mean, SD, and EI Two-sided tolerance intervals (not covered in this webinar). Are you interested in EXTREMES (e.g. higher order quantiles) Estimate quantiles and TI Non-parametric (with large samples), or parametric (with small samples), but the latter is EXTREMELY sensitive to assumption of normality. 57 / 118
58 Assessing Normality Assessing Normality - Normal Quantile Plots 58 / 118
59 Assessing Normality Objectives: Constructing a normal quantile plot. Assessing normality based on quantile plot. Detecting outliers and skewness based on quantile plot 59 / 118
60 Assessing Normality - Quantile Plots Quantile-plots are a graphical method to assess distributional assumptions: Easier to assess straight-line fit rather than fitting to a curve It is not necessary to create arbitrary bins as is needed for histograms. All of the data are displayed unlike box-plots. Every point in the data can be displayed without overlap. 60 / 118
61 Assessing Normality - Normal Quantile Plot Construction: 1 Sort the data (Y ) from smallest to largest. 2 Number the sorted values from 1 to n. No adjustment is made for tied values. 3 Create the plotting positions as p i = (i 0.4)/(n + 0.2). 4 Compute the normal quantile (Q p ) using the NORMINV(p i ) or the NORM.S.INV(p i ) function in Excel. 5 Plot Q p (on the X-axis) vs Y on the Y-axis. If the distribution is correct for the data, the points should lie on an approximate straight line. Slope of the line estimates s (the sample standard deviation); Value of the curve at X = 0 estimates the mean. 61 / 118
62 Assessing Normality - Quantile Plots Quantile plot of the original data. 62 / 118
63 Assessing Normality - Quantile Plots Quantile plot of the log(as) data. 63 / 118
64 Assessing Normality - Quantile Plots 64 / 118
65 Assessing Normality - Quantile Plots 65 / 118
66 Assessing Normality - Quantile Plots 66 / 118
67 Assessing Normality - Quantile Plots 67 / 118
68 Assessing Normality - Summary Q-Q plots are easy way to assess normality. Look for departures from linearity (but don t over-interpret the plots). For inference about the MEAN, not too crucial that underlying distribution is normal except in (very) small sample sizes. For inference about extreme percentiles, it is crucial that normality assumptions be satisfied. 68 / 118
69 Linear Models Comparing Water Quality Readings 69 / 118
70 Comparing Water Quality Readings Objectives: How does WQ compare between 2+ sites Types of designs (paired vs. unpaired) Case study - French Creek 70 / 118
71 Comparing Water Quality Readings Types of designs Paired/blocked Paired (2 sites) or Blocked (2+ sites) designs are similar Synoptic (same day) or near synoptic (same week?) readings are taken Interested in the average difference. Not necessary to have a random sample to times. It may be preferable to select times to enhance contrast (e.g. sample at low and peak flows). Paired t-test; Single Factor Randomized Blocked ANOVA Independent samples (not part of this course). 2+ sites to be compared. Separate, random samples taken from each site. Interested in compare the MEANS across the sites. Independent sample t-test; Single Factor Completely Randomized Design ANOVA (a.k.a. One-way ANOVA) You MUST match the analysis to the design! 71 / 118
72 Comparing Water Quality Readings - Case Study French Creek - data set available Five locations along French Creek Monthly + two sets of 5-in-30 samples starting late-july and late-october (Near) Synoptic data How do the readings compare across the sites? 72 / 118
73 Comparing Water Quality Readings - 2 Paired Sites Compare readings in 2 sites. Align the data by date Find the difference in readings or log(ratio) of readings Use the difference if the range of values is small so that the differences between sites are relatively similar (e.g. a consistent difference of about 2 units). Use the log(ratio) if the range of values is large so that differences between sites are NOT relatively similar (e.g. range from 2 to 200 units), but RATIO of readings (e.g. one site s readings is about twice the other site). Compute the mean difference, se of mean difference, 95% confidence interval for mean difference and see if the 95% confidence interval includes 0. If want a p-value, test the hypothesis that mean difference in population is zero. 73 / 118
74 Comparing Water Quality Readings - 2 Paired Sites Turbidity Readings Sampling BARCLAY GRAFTON NEW WINCH Week BRG COOMBS ROAD HWY -ESTER / 118
75 Comparing Water Quality Readings - 2 Paired Sites Start with plots over time for each site: 75 / 118
76 Comparing Water Quality Readings - 2 Paired Sites Compute the difference or log(ratio) between two sites (e.g. Barclay vs. Coombs). Drop cases where missing values are present. 76 / 118
77 Comparing Water Quality Readings - 2 Paired Sites 77 / 118
78 Comparing Water Quality Readings - 2 Paired Sites Compute using Excel functions Number of differences/log-ratios - count() Mean difference/log-ratio - average() Std dev of diff/log-ration - stdev.s () 95% CI half width - confidence.t() 95% CI mean ± 95% CI half width t-test for testing equality of means - t.test() Also take anti-logs of mean and 95% CI for log-ratio 78 / 118
79 Comparing Water Quality Readings - 2 Paired Sites 79 / 118
80 Comparing Water Quality Readings - 2 Paired Sites Barclay averages.947 NTU higher than Coombs with a 95% c.i. of between 0.45 and 1.44 NTU higher than Coombs. Barclay is, on average, approx 1.98x (95% CI 1.70x to 2.30x) larger than Coombs. CAUTION: 95% c.i. say nothing about individual differences or individual ratios, i.e. NOT CORRECT TO SAY that 95% of differences lie between 0.45 and 1.44 NTU. 80 / 118
81 Comparing Water Quality Readings - 2 Paired Sites CAUTIONS: Excel does not deal with missing data very nicely. Look what happens if last reading for Coombs is missing. 81 / 118
82 Comparing Water Quality Readings - 2 Paired Sites Similar output from JMP with additional graphs to show that log(ratio) is likely better choice than difference: 82 / 118
83 Comparing Water Quality Readings - 2 Paired Sites JMP also handles missing values automatically: 83 / 118
84 Comparing Water Quality Readings - 2 Paired Sites A formal p-value is un-necessary but can be obtained as well. Because the p-values are very small, there is strong evidence of a difference (on average) between the two sites. 84 / 118
85 Comparing Water Quality Readings - 2 Paired Sites Summary Pairing is induced by synoptic readings Delete any pairs with missing values. More advanced software (e.g. R/SAS/JMP) can also incorporate missing data but this is beyond scope of course. CAUTION: EXCEL does NOT handle missing data well. Compute differences and/or log(ratio) Use differences if readings are similar over time Use log(ratio) if large variation in readings and RATIO is consistent over time Compute mean difference or log-ratio and 95% confidence interval for population mean difference or log-ratio. Is 0 included in the 95% confidence interval? If so, then there is no evidence of a difference (on average). CAUTION: 95% confidence interval say NOTHING about individual differences or log-ratios Use anti-log on mean and 95% c.i. to convert log-ratio back to ratios. 85 / 118
86 Comparing Water Quality Readings - 3+ Paired Sites It is possible to extend the analysis to 3+ more sites with synoptic readings. Make an array of week by site Record actual values or ln(values) CAUTION: EXCEL does not ALLOW for ANY missing data. You must exclude entire sampling week if any data is missing. This can lead to a LARGE loss of data. JMP/SAS/R can gracefully deal with missing data. 86 / 118
87 Comparing Water Quality Readings - 3+ Paired Sites Notice that readings on are dropped FOR all sites because missing data at New Highway 87 / 118
88 Comparing Water Quality Readings - 3+ Paired Sites Known as a Randomized Block Design Blocks = Synoptic Times = device for pairing up observations that are affected in similar way. Assume that differences among sites is relatively consistent across the blocks == NO INTERACTION between blocks and sites. (OR) Assume that ratio among sites is relatively consistent across the blocks so that differences of log(values) is relatively consistent (e.g. site A might always be about 2x larger than site B). Again start with plot of values as seen earlier 88 / 118
89 Comparing Water Quality Readings - 3+ Paired Sites Start with plots over time for each site: 89 / 118
90 Comparing Water Quality Readings - 3+ Paired Sites 90 / 118
91 Comparing Water Quality Readings - 3+ Paired Sites 91 / 118
92 Comparing Water Quality Readings - 3+ Paired Sites Here, column effect = effects of SITES P-value = 0.02, so there is some evidence of a consistent difference in the MEANS among sites Does NOT indicate which sites could have the same or different means. Need to follow-up with a Tukey Multiple Comparison Procedure to identify which pairs of sites could have different means. 92 / 118
93 Comparing Water Quality Readings - 3+ Paired Sites Look at estimates of MARGINAL means: Critical range is CR = Q MSE a # blocks where Q a is value from Studentized range with df 1 = #sites and df 2 = df MSE. In this case we look for (5, 52) df (see previous slide) at duke.edu/courses/spring98/sta110c/qtable.html and find 0.65 the Q a = 4.20 and CR = = Then two means could be different if Y 1 Y 2 > CR = / 118
94 Comparing Water Quality Readings - 3+ Paired Sites This is VERY tedious and error prone in EXCEL use a proper package such as JMP/ R/ SAS etc. The output is automatic and more-informative and can handle missing data. 94 / 118
95 Comparing Water Quality Readings - 3+ Paired Sites P-value is small, so there is some evidence of a difference in means among the sites. It does not indicate where the difference may lie. 95 / 118
96 Comparing Water Quality Readings - 3+ Paired Sites This indicates which Sites could have the same mean. Think of paint-chips to understand overlap in ranges of sites that could be the same. 96 / 118
97 Comparing Water Quality Readings - 3+ Paired Sites Provides estimates of effects and confidence intervals for each pairwise difference. 97 / 118
98 Comparing Water Quality Readings - Summary Typically used for synoptic data to see if sites are comparable Exactly 2 sites: Find difference or log(ratio) of readings from both sites. Drop any sites with missing data. Find mean and 95% confidence interval for difference in MEAN See if 0 is included in the confidence interval. Find p-value using Paired t-test. 98 / 118
99 Comparing Water Quality Readings - Summary 3+ Site: Analyze either raw data or log(data). Use Randomized Block Design analysis (Excel: Two-factor with no replication). Look at ANOVA table at either Rows/Columns effects that correspond to SITES. Program Tukey Multiple Comparison procedure by hand (groan)! CAUTION: EXCEL does not deal with missing values correctly - GIVES WRONG ANSWERS. CAUTION: EXCEL is very clumsy in finding where the differences lie. You must program Tukey procedure. You will make mistakes in doing this! CAUTION: EXCEL does not provide other output to check assumptions of the models. Avoid EXCEL use a proper package such as JMP/ R/ SAS! 99 / 118
100 Case Study - Barclay Bridge Turbidity Case Study - French Creek - Barclay Bridge - Turbidity 100 / 118
101 Case Study - Barclay Bridge Turbidity Monthly sampling in 30 samples 101 / 118
102 Case Study - Barclay Bridge Turbidity What are the objectives? Find the distribution of turbidity across the year? Estimate the 95 th percentile across the year? Estimate the mean and percentiles only in November? Problems: Non-random sampling across the year with some days having a higher probability of sampling than other days. More samples deliberately selected in August (when the turbidity is low) and more samples deliberately taken in November when turbidity is high. High autocorrelation in values taken close together in the 5-in-30 samples. Standard errors will be understated, i.e. you will think you are more precise than you are. Not clear how to interpret means and percentiles for part of a year. The 95 th percentile based on the 5-in-30 will NOT estimate the 95 th percentile for the year. 102 / 118
103 Case Study - Barclay Bridge Turbidity Example of bias in computing percentiles. Simulated data to follow data based on previous curve with some random noise around the curve. Computed 95 th percentile based on yearly data and on sample dates data. 103 / 118
104 Case Study - Barclay Bridge Turbidity Example of bias in computing means. Simulated data to follow data based on previous curve with some random noise around the curve. Computed mean based on yearly data and on sample dates data. 104 / 118
105 Case Study - Barclay Bridge Turbidity Not clear what to do with this type of data without some consideration of filling in some of the missing data. 105 / 118
106 Case Study - Mercantile Readings are taken monthly, except 5-in-30 days samples are again taken in August and November. There are many censored readings (indicated by the < character in the adjacent column. There duplicate and split samples for QA/QC work. There is seasonality in some of the characteristics. 106 / 118
107 Case Study - Mercantile Duplicate and split-sample measurements. These are NOT independent observations and the usual way to deal with these is to the the average of the duplicate or split-sample measurements. Outliers. The sole outliers for NO3 and Turbidity are synoptic. What happened? For some variables (e.g. NO3), weak seasonality present so some pooling over all month in a year is a possibility. I am still worried about the 5-in-30 readings as serial correlation may be a problem? Some variables are highly censored while others are lightly censored. There is no easy way to deal with censoring in Excel other than to either ignore it (i.e. treat the limit as the data value) or use 1/2 of the detection limit as the data value. 107 / 118
108 Case Study - Mercantile Turbidity and NO3 in Mercantile Creek: 108 / 118
109 Case Study - Mercantile Several interesting features. There appears to be seasonality with high readings in October and April, but they are not consistent over time (e.g. look what happened in 2003). Sampling intensity is not uniform over the years with some months missed and no apparent 5-in-30 sampling occasions. A very large value occurred in November 2002 with very small values on the months on either side of it indicating a very volatile system. As in the French Creek dataset, it is not clear what the 95th percentile is supposed to be measuring? Just the peak events? Across the entire year? There appears to be an increasing trend in both the mean and variability. Fitting a trend line to this data set is beyond the capabilities of Excel because of the problems above. 109 / 118
110 Case Study - Mercantile Try a linear fit to the Turbidity data after dropping the outliers in This was done in R (don t use Excel - gives WRONG results for many regressions!) More details on fitting linear models available in my course notes. Estimate Std. Error t value Pr(> t (Intercept) Date Standard errors may be too small because of the autocorrelation in the residuals. Strong evidence of an increase over time. 110 / 118
111 Case Study - Mercantile Try a linear fit to the Turbidity data after dropping the outliers in / 118
112 Case Study - Mercantile Try a linear fit to the Turbidity data after dropping the outliers in Residual plots (lower left) shows increase in variance with mean. 112 / 118
113 Case Study - Mercantile Try a linear fit to the NO 3 data after dropping the outliers in This was done in R (don t use Excel - gives WRONG results for many regressions!) More details on fitting linear models available in my course notes. Estimate Std. Error t value Pr(> t ) (Intercept) e e Date e e Standard errors may be too small because of the autocorrelation in the residuals. No evidence of an increase over time. 113 / 118
114 Case Study - Mercantile Try a linear fit to the NO 3 data after dropping the outliers in / 118
115 Case Study - Mercantile Try a linear fit to the NO 3 data after dropping the outliers in Residual plots now look all ok. 115 / 118
116 Summary RRR s No easy way to compute means and percentiles for non-randomly selected data (e.g. monthly + 5-in-30 data). More flexibility when comparing across sites or trends over time. Some caution needed if readings are too close together (e.g. hourly or daily). See my webinar for Air Quality. What are you interested in? Averages - estimate means, SE, and CI Spread - estimate means, SD, and EI Extremes - estimate percentiles and TI (either parametric or non-parametric). CAUTION. These results are extremely sensitive to violations of the RRRs and distributional assumptions. Assess normality using Normal Quantile plots. Trends - use linear models (see my webinar for Air Quality) No amount of statistical wizardry can rescue poorly corrected data! You will be severely constrained in your analyses if you only use Excel! 116 / 118
117 Summary Missing values? If missingness is MCAR, then no problems. MAR/ IM are more/ very difficulty to deal with. Outliers? Run analysis with outliers in and with outliers out. If no difference then who cares. 117 / 118
118 Summary Other possible analyzes not-part of this course: Quantile-regression where you model the change in percentiles, e.g. has the 90 th percentile changed over time? Testing for changes in the standard deviation over time rather than the mean to see if variability has changed over time. Modeling the number of events (e.g. days exceeding WQ guidelines in a month) and if they change over time. Further help, contact Carl Schwarz stat.sfu.ca) 118 / 118
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More informationseven Statistical Analysis with Excel chapter OVERVIEW CHAPTER
seven Statistical Analysis with Excel CHAPTER chapter OVERVIEW 7.1 Introduction 7.2 Understanding Data 7.3 Relationships in Data 7.4 Distributions 7.5 Summary 7.6 Exercises 147 148 CHAPTER 7 Statistical
More informationChapter 7. One-way ANOVA
Chapter 7 One-way ANOVA One-way ANOVA examines equality of population means for a quantitative outcome and a single categorical explanatory variable with any number of levels. The t-test of Chapter 6 looks
More informationChicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011
Chicago Booth BUSINESS STATISTICS 41000 Final Exam Fall 2011 Name: Section: I pledge my honor that I have not violated the Honor Code Signature: This exam has 34 pages. You have 3 hours to complete this
More informationTechnical Guidance for Exploring TMDL Effectiveness Monitoring Data
December 2011 Technical Guidance for Exploring TMDL Effectiveness Monitoring Data 1. Introduction Effectiveness monitoring is a critical step in the Total Maximum Daily Load (TMDL) process for addressing
More informationExercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
More information1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96
1 Final Review 2 Review 2.1 CI 1-propZint Scenario 1 A TV manufacturer claims in its warranty brochure that in the past not more than 10 percent of its TV sets needed any repair during the first two years
More informationPart II Chapter 9 Chapter 10 Chapter 11 Chapter 12 Chapter 13 Chapter 14 Chapter 15 Part II
Part II covers diagnostic evaluations of historical facility data for checking key assumptions implicit in the recommended statistical tests and for making appropriate adjustments to the data (e.g., consideration
More informationData Analysis Tools. Tools for Summarizing Data
Data Analysis Tools This section of the notes is meant to introduce you to many of the tools that are provided by Excel under the Tools/Data Analysis menu item. If your computer does not have that tool
More informationbusiness statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar
business statistics using Excel Glyn Davis & Branko Pecar OXFORD UNIVERSITY PRESS Detailed contents Introduction to Microsoft Excel 2003 Overview Learning Objectives 1.1 Introduction to Microsoft Excel
More informationHow To Check For Differences In The One Way Anova
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
More informationBowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition
Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Online Learning Centre Technology Step-by-Step - Excel Microsoft Excel is a spreadsheet software application
More informationConfidence Intervals for the Difference Between Two Means
Chapter 47 Confidence Intervals for the Difference Between Two Means Introduction This procedure calculates the sample size necessary to achieve a specified distance from the difference in sample means
More informationLecture 2: Descriptive Statistics and Exploratory Data Analysis
Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals
More informationComparing Means in Two Populations
Comparing Means in Two Populations Overview The previous section discussed hypothesis testing when sampling from a single population (either a single mean or two means from the same population). Now we
More informationKSTAT MINI-MANUAL. Decision Sciences 434 Kellogg Graduate School of Management
KSTAT MINI-MANUAL Decision Sciences 434 Kellogg Graduate School of Management Kstat is a set of macros added to Excel and it will enable you to do the statistics required for this course very easily. To
More informationSTAT 350 Practice Final Exam Solution (Spring 2015)
PART 1: Multiple Choice Questions: 1) A study was conducted to compare five different training programs for improving endurance. Forty subjects were randomly divided into five groups of eight subjects
More informationRecall this chart that showed how most of our course would be organized:
Chapter 4 One-Way ANOVA Recall this chart that showed how most of our course would be organized: Explanatory Variable(s) Response Variable Methods Categorical Categorical Contingency Tables Categorical
More informationHow To Test For Significance On A Data Set
Non-Parametric Univariate Tests: 1 Sample Sign Test 1 1 SAMPLE SIGN TEST A non-parametric equivalent of the 1 SAMPLE T-TEST. ASSUMPTIONS: Data is non-normally distributed, even after log transforming.
More informationMinitab Tutorials for Design and Analysis of Experiments. Table of Contents
Table of Contents Introduction to Minitab...2 Example 1 One-Way ANOVA...3 Determining Sample Size in One-way ANOVA...8 Example 2 Two-factor Factorial Design...9 Example 3: Randomized Complete Block Design...14
More information1. How different is the t distribution from the normal?
Statistics 101 106 Lecture 7 (20 October 98) c David Pollard Page 1 Read M&M 7.1 and 7.2, ignoring starred parts. Reread M&M 3.2. The effects of estimated variances on normal approximations. t-distributions.
More informationSimple linear regression
Simple linear regression Introduction Simple linear regression is a statistical method for obtaining a formula to predict values of one variable from another where there is a causal relationship between
More informationDATA INTERPRETATION AND STATISTICS
PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE
More informationAn introduction to using Microsoft Excel for quantitative data analysis
Contents An introduction to using Microsoft Excel for quantitative data analysis 1 Introduction... 1 2 Why use Excel?... 2 3 Quantitative data analysis tools in Excel... 3 4 Entering your data... 6 5 Preparing
More informationt Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon
t-tests in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com www.excelmasterseries.com
More informationStatistics courses often teach the two-sample t-test, linear regression, and analysis of variance
2 Making Connections: The Two-Sample t-test, Regression, and ANOVA In theory, there s no difference between theory and practice. In practice, there is. Yogi Berra 1 Statistics courses often teach the two-sample
More information5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
More informationPredictor Coef StDev T P Constant 970667056 616256122 1.58 0.154 X 0.00293 0.06163 0.05 0.963. S = 0.5597 R-Sq = 0.0% R-Sq(adj) = 0.
Statistical analysis using Microsoft Excel Microsoft Excel spreadsheets have become somewhat of a standard for data storage, at least for smaller data sets. This, along with the program often being packaged
More informationAn analysis appropriate for a quantitative outcome and a single quantitative explanatory. 9.1 The model behind linear regression
Chapter 9 Simple Linear Regression An analysis appropriate for a quantitative outcome and a single quantitative explanatory variable. 9.1 The model behind linear regression When we are examining the relationship
More informationNCSS Statistical Software
Chapter 06 Introduction This procedure provides several reports for the comparison of two distributions, including confidence intervals for the difference in means, two-sample t-tests, the z-test, the
More informationSimple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
More informationX X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)
CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.
More informationSPSS Manual for Introductory Applied Statistics: A Variable Approach
SPSS Manual for Introductory Applied Statistics: A Variable Approach John Gabrosek Department of Statistics Grand Valley State University Allendale, MI USA August 2013 2 Copyright 2013 John Gabrosek. All
More information2013 MBA Jump Start Program. Statistics Module Part 3
2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just
More informationAn analysis method for a quantitative outcome and two categorical explanatory variables.
Chapter 11 Two-Way ANOVA An analysis method for a quantitative outcome and two categorical explanatory variables. If an experiment has a quantitative outcome and two categorical explanatory variables that
More informationSkewed Data and Non-parametric Methods
0 2 4 6 8 10 12 14 Skewed Data and Non-parametric Methods Comparing two groups: t-test assumes data are: 1. Normally distributed, and 2. both samples have the same SD (i.e. one sample is simply shifted
More informationLAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING
LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING In this lab you will explore the concept of a confidence interval and hypothesis testing through a simulation problem in engineering setting.
More information2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
More informationDESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS
DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 seema@iasri.res.in 1. Descriptive Statistics Statistics
More informationChapter 7 Section 7.1: Inference for the Mean of a Population
Chapter 7 Section 7.1: Inference for the Mean of a Population Now let s look at a similar situation Take an SRS of size n Normal Population : N(, ). Both and are unknown parameters. Unlike what we used
More informationExploratory Data Analysis
Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction
More informationSUMAN DUVVURU STAT 567 PROJECT REPORT
SUMAN DUVVURU STAT 567 PROJECT REPORT SURVIVAL ANALYSIS OF HEROIN ADDICTS Background and introduction: Current illicit drug use among teens is continuing to increase in many countries around the world.
More informationSimple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
More informationSimple Linear Regression
STAT 101 Dr. Kari Lock Morgan Simple Linear Regression SECTIONS 9.3 Confidence and prediction intervals (9.3) Conditions for inference (9.1) Want More Stats??? If you have enjoyed learning how to analyze
More informationPrinciples of Hypothesis Testing for Public Health
Principles of Hypothesis Testing for Public Health Laura Lee Johnson, Ph.D. Statistician National Center for Complementary and Alternative Medicine johnslau@mail.nih.gov Fall 2011 Answers to Questions
More informationThe Variability of P-Values. Summary
The Variability of P-Values Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 27695-8203 boos@stat.ncsu.edu August 15, 2009 NC State Statistics Departement Tech Report
More informationUsing Excel for inferential statistics
FACT SHEET Using Excel for inferential statistics Introduction When you collect data, you expect a certain amount of variation, just caused by chance. A wide variety of statistical tests can be applied
More informationNormality Testing in Excel
Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More information2. Filling Data Gaps, Data validation & Descriptive Statistics
2. Filling Data Gaps, Data validation & Descriptive Statistics Dr. Prasad Modak Background Data collected from field may suffer from these problems Data may contain gaps ( = no readings during this period)
More informationA Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item
More informationSimple Regression Theory II 2010 Samuel L. Baker
SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the
More informationUsing Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data
Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable
More informationBIOL 933 Lab 6 Fall 2015. Data Transformation
BIOL 933 Lab 6 Fall 2015 Data Transformation Transformations in R General overview Log transformation Power transformation The pitfalls of interpreting interactions in transformed data Transformations
More informationNCSS Statistical Software
Chapter 06 Introduction This procedure provides several reports for the comparison of two distributions, including confidence intervals for the difference in means, two-sample t-tests, the z-test, the
More informationBill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1
Bill Burton Albert Einstein College of Medicine william.burton@einstein.yu.edu April 28, 2014 EERS: Managing the Tension Between Rigor and Resources 1 Calculate counts, means, and standard deviations Produce
More informationDescriptive Statistics
Descriptive Statistics Suppose following data have been collected (heights of 99 five-year-old boys) 117.9 11.2 112.9 115.9 18. 14.6 17.1 117.9 111.8 16.3 111. 1.4 112.1 19.2 11. 15.4 99.4 11.1 13.3 16.9
More information" Y. Notation and Equations for Regression Lecture 11/4. Notation:
Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through
More informationBasic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma siersma@sund.ku.dk The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
More informationGeostatistics Exploratory Analysis
Instituto Superior de Estatística e Gestão de Informação Universidade Nova de Lisboa Master of Science in Geospatial Technologies Geostatistics Exploratory Analysis Carlos Alberto Felgueiras cfelgueiras@isegi.unl.pt
More informationTutorial 5: Hypothesis Testing
Tutorial 5: Hypothesis Testing Rob Nicholls nicholls@mrc-lmb.cam.ac.uk MRC LMB Statistics Course 2014 Contents 1 Introduction................................ 1 2 Testing distributional assumptions....................
More informationStatistics Review PSY379
Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses
More informationBusiness Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.
Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing
More informationCommon Tools for Displaying and Communicating Data for Process Improvement
Common Tools for Displaying and Communicating Data for Process Improvement Packet includes: Tool Use Page # Box and Whisker Plot Check Sheet Control Chart Histogram Pareto Diagram Run Chart Scatter Plot
More informationGood luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:
Glo bal Leadership M BA BUSINESS STATISTICS FINAL EXAM Name: INSTRUCTIONS 1. Do not open this exam until instructed to do so. 2. Be sure to fill in your name before starting the exam. 3. You have two hours
More informationIntroduction to Regression and Data Analysis
Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it
More informationAssumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model
Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity
More informationExploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
More informationGetting Correct Results from PROC REG
Getting Correct Results from PROC REG Nathaniel Derby, Statis Pro Data Analytics, Seattle, WA ABSTRACT PROC REG, SAS s implementation of linear regression, is often used to fit a line without checking
More informationIntroduction to StatsDirect, 11/05/2012 1
INTRODUCTION TO STATSDIRECT PART 1... 2 INTRODUCTION... 2 Why Use StatsDirect... 2 ACCESSING STATSDIRECT FOR WINDOWS XP... 4 DATA ENTRY... 5 Missing Data... 6 Opening an Excel Workbook... 6 Moving around
More informationChapter 7: Simple linear regression Learning Objectives
Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -
More informationHow To Run Statistical Tests in Excel
How To Run Statistical Tests in Excel Microsoft Excel is your best tool for storing and manipulating data, calculating basic descriptive statistics such as means and standard deviations, and conducting
More informationPremaster Statistics Tutorial 4 Full solutions
Premaster Statistics Tutorial 4 Full solutions Regression analysis Q1 (based on Doane & Seward, 4/E, 12.7) a. Interpret the slope of the fitted regression = 125,000 + 150. b. What is the prediction for
More informationHYPOTHESIS TESTING WITH SPSS:
HYPOTHESIS TESTING WITH SPSS: A NON-STATISTICIAN S GUIDE & TUTORIAL by Dr. Jim Mirabella SPSS 14.0 screenshots reprinted with permission from SPSS Inc. Published June 2006 Copyright Dr. Jim Mirabella CHAPTER
More informationModule 3: Correlation and Covariance
Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis
More informationDescriptive Statistics
Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web
More informationBNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
More informationExploratory Data Analysis
Exploratory Data Analysis Paul Cohen ISTA 370 Spring, 2012 Paul Cohen ISTA 370 () Exploratory Data Analysis Spring, 2012 1 / 46 Outline Data, revisited The purpose of exploratory data analysis Learning
More informationStatistical Functions in Excel
Statistical Functions in Excel There are many statistical functions in Excel. Moreover, there are other functions that are not specified as statistical functions that are helpful in some statistical analyses.
More informationDensity Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:
Density Curve A density curve is the graph of a continuous probability distribution. It must satisfy the following properties: 1. The total area under the curve must equal 1. 2. Every point on the curve
More information1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ
STA 3024 Practice Problems Exam 2 NOTE: These are just Practice Problems. This is NOT meant to look just like the test, and it is NOT the only thing that you should study. Make sure you know all the material
More informationChapter 7 Section 1 Homework Set A
Chapter 7 Section 1 Homework Set A 7.15 Finding the critical value t *. What critical value t * from Table D (use software, go to the web and type t distribution applet) should be used to calculate the
More informationOnce saved, if the file was zipped you will need to unzip it. For the files that I will be posting you need to change the preferences.
1 Commands in JMP and Statcrunch Below are a set of commands in JMP and Statcrunch which facilitate a basic statistical analysis. The first part concerns commands in JMP, the second part is for analysis
More informationChapter 10. Key Ideas Correlation, Correlation Coefficient (r),
Chapter 0 Key Ideas Correlation, Correlation Coefficient (r), Section 0-: Overview We have already explored the basics of describing single variable data sets. However, when two quantitative variables
More informationChapter 23. Inferences for Regression
Chapter 23. Inferences for Regression Topics covered in this chapter: Simple Linear Regression Simple Linear Regression Example 23.1: Crying and IQ The Problem: Infants who cry easily may be more easily
More informationSummary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)
Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume
More informationCourse Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics
Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This
More informationGamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
More informationEngineering Problem Solving and Excel. EGN 1006 Introduction to Engineering
Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques
More informationAn Introduction to Statistics using Microsoft Excel. Dan Remenyi George Onofrei Joe English
An Introduction to Statistics using Microsoft Excel BY Dan Remenyi George Onofrei Joe English Published by Academic Publishing Limited Copyright 2009 Academic Publishing Limited All rights reserved. No
More informationUNDERSTANDING THE INDEPENDENT-SAMPLES t TEST
UNDERSTANDING The independent-samples t test evaluates the difference between the means of two independent or unrelated groups. That is, we evaluate whether the means for two independent groups are significantly
More information6.4 Normal Distribution
Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under
More informationPart 2: Analysis of Relationship Between Two Variables
Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable
More informationApplying Statistics Recommended by Regulatory Documents
Applying Statistics Recommended by Regulatory Documents Steven Walfish President, Statistical Outsourcing Services steven@statisticaloutsourcingservices.com 301-325 325-31293129 About the Speaker Mr. Steven
More informationPermutation Tests for Comparing Two Populations
Permutation Tests for Comparing Two Populations Ferry Butar Butar, Ph.D. Jae-Wan Park Abstract Permutation tests for comparing two populations could be widely used in practice because of flexibility of
More informationMISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
More informationIntroduction to Environmental Statistics. The Big Picture. Populations and Samples. Sample Data. Examples of sample data
A Few Sources for Data Examples Used Introduction to Environmental Statistics Professor Jessica Utts University of California, Irvine jutts@uci.edu 1. Statistical Methods in Water Resources by D.R. Helsel
More information