STA 248 Winter 2005 Assignment 1

STA 248 Winter 2005 Assignment 1 Due: Thursday, January 27 at beginning of lecture. (Late assignments will be subject to a deduction of 10% of the total marks for the assignment for each day late.) Please hand in your R code when used. On future assignments I won t be typing out the textbook problems. Let me know if you have any difficulty getting a copy of the text. Problems to be handed in for marking: Chapter 6: 12, 17, 25 Chapter 7: 9, 23, 25, 30, 59 Additional problems: 3 Problems from the textbook: Chapter 6: 8. (a) Acute exposure to cadmium produces respiratory distress and kidney and liver damage, and may even result in death. For this reason, the level of airborne cadmium dust and cadmium oxide fume in the air is monitored. This level is measured in milligrams cadmium per cubic meter of air. A sample of 35 readings yield the given data (available on the web). (a) Construct a stem-and-leaf diagram for these data. Use the numbers 02, 03, 04, 05, 06, and 07 as stems. (Do by hand.) (b) Would you be surprised to hear someone claim that the random variable X, the cadmium level in the air, is normally distributed? Explain. (d) Use R to construct a relative frequency histogram for these data. Does the histogram exhibit the bell-shape characteristic of a normal density? (e) Construct a relative cumulative frequency ogive for these data. Use the ogive to approximate that point above which 50% of the readings should fall. 12. (Percentiles.) Let X be a random variable. The point p k/100 such that P [X < p k/100 ] k/100 and P [X p k/100 ] k/100 is called the kth percentile for X. For example, let X be binomial with n = 20 and p =.5. The 25th percentile for X is the point p 25/100 = 8 since P [X < 8] =.1316.25 and P [X 8] =.2517.25 (a) Let X be binomial with n = 20 and p =.5. Find the 60th percentile for X. (b) Let X be Poisson with λ = 10. Find the 30th percentile for X. (d) Let X be exponentially distributed with β = 1. Show that the 20th percentile for X is ln.80. Hint: Find the point p such that p e x dx =.20 0 1

17. Consider the two given data sets (available on web). (a) Find the sample mean and sample median for each data set. (b) Find the sample range for each data set. (c) Find the sample variance and sample standard deviation for each data set. (d) Would you be surprised to hear someone claim that these data were drawn from the same population? Explain. Hint: Consider the shape of the distribution as well as the observed values of the sample statistics. 20. Use the data of Exercise 8 to approximate the mean, variance, and standard deviation of the random variable X, the level of airborne cadmium dust and cadmium oxide fumes. Assume that these approximations are fairly accurate. Between what two values would you expect approximately 95% of the readings to fall? Explain. 25. (Approximating σ via the range.) The range can play an important role in the design of statistical studies. To obtain a prespecified degree of accuracy when estimating population parameters, an adequate sized sample must be drawn. Most formulas used to determine sample size require knowledge of σ, the population standard deviation. Often the researcher will not have an estimate of σ available but will have an idea of the expected range of his or her data. When sampling from a normal distribution,. P [ 2σ < X µ < 2σ] =.95 If X is not normally distributed, then Chebyshev s inequality can be applied to conclude that P [ 3σ < X µ < 3σ].89 That is, X always lies within at most 3 standard deviations of its mean with high probability. From this it can be concluded that the estimated range covers an interval of roughly 4σ for normally distributed random variables and 6σ otherwise. In the normal case an estimate of σ can be obtained by solving the equation 4σ. = estimated range for σ. If X is not normally distributed, then σ. = (estimate range)/6 Data are given (available on the web) for the random variable X, the cpu time in seconds required to run a program using a statistical package. (a) Construct a stem-and-leaf diagram for these data. Is the assumption justified that X is normally distributed? (b) Approximate σ via the sample standard deviation s. (c) Find the sample range for these data, and use it to approximate σ. Compare your result to that obtained in part (b). 27. Let X be normally distributed with mean µ and variance σ 2. (a) Verify that q 3 = µ +.67σ and that q 1 = µ.67σ. (b) Find the interquartile range for X. (c) Verify that the inner fences for X are f 1 = µ 2.68σ and f 3 = µ + 2.68σ. (d) Verify that the probability that X will fall beyond the inner fences is approximately.007. 2

28. Temperature differences between the warm upper surface of the ocean and the colder deeper levels can be utilized to convert thermal energy to mechanical energy. This mechanical energy can in turn be used to produce electrical power using a vapor turbine. Let X denote the difference in temperature between the surface of the water and the water at a depth of 1 kilometer. Measurements are taken at 15 randomly selected sites in the Gulf of Mexico. The measured temperatures are available on the web. Use R to do the following. (a) Construct a double stem-and-leaf diagram for these data. (b) Find the sample mean, sample median, and sample standard deviation for these data. (c) Not that the observation with value 10.1 is very different from the others. It is a potential outlier. Construct a boxplot for these data to verify that the value 10.1 does appear to be an outlier. (d) To see the effect of this outlier, drop it from the data set and calculate the sample mean, median, and standard deviation for the remaining 14 observations. Which measure is least affected by the presence of the outlier? 36. It is known that power surges or line spikes can damage sensitive electronic equipment. A study of these surges is conducted. The purpose of the study is to ascertain whether or not there are differences in the frequency of these surges among the seven days of the week. Data for the study is found on the website. Variables are observation number; day, with m = Monday, t = Tuesday, w= Wednesday, th = Thursday, f = Friday, s = Saturday, and sn = Sunday; and number of spikes per day. Use R to do the following. (a) Obtain descriptive statistics on the number of spikes per day for each day of the week. Discuss any differences among days that appear to exist. (b) Construct boxplots for each day, and use the boxplots for a visual comparison of the days. Chapter 7: 1. Let X 1, X 2,..., X 20 be a random sample from a distribution with mean 8 and variance 5. Find the mean and variance of X. 5. Let X 1, X 2, X 3, X 4, X 5 be a random sample from a binomial distribution with n = 10 and p unknown. (a) Show that X/10 is an unbiased estimator for p. (b) Estimate p based on these data: 3, 4, 4, 5, 6. 9. (Weighted means.) Assume that one has k independent random samples of sizes n 1, n 2,..., n k from the same distribution. These samples generate k unbiased estimators for the mean, namely, X 1, X 2,..., X k. (a) Show that the arithmetic average of these estimators, (X 1 + X 2 + X k )/k, is also unbiased for µ. (b) Certain mineral elements required by plants are classed as macronutrients. Macronutrients are measured in terms of their percentage of the dry weight of the plant. Proportions of each element vary in different species and in the same species grown under differeing conditions. One macronutrient is sulfur. In a 3

study of winter cress, a member of the mustard family, these data, based on three independent random samples, are obtained: x 1 =.8 x 2 =.95 x 3 =.7 n 1 = 9 n 2 = 3 n 3 = 200 Use the result of part (a) to obtain an unbiased estimate for µ, the mean proportion of sulfur by dry weight in winter cress. By averaging the three values.8,.95, and.7 to obtain the estimate for µ, each sample is being given equal importance or weight. Does this seem reasonable in this problem? Explain. (c) To take sample sizes into account, a weighted mean is used. This estimator, ˆµ W, is given by ˆµ W = n 1X 1 + + n k X k n 1 + + n k Show that ˆµ W is an unbiased estimator for µ. (d) Use the data of part (b) to find the weighted estimate for the mean proportion of sulfur by dry weight in winter cress. Compare your answer to the estimate found in part (b). 16. Let X 1, X 2,..., X m be a random sample of size m from a binomial distribution with parameters n, assumed to be known, and p. Show that the method of moments estimator for p is ˆp = X/n. 17. Let X 1, X 2,..., X n be a random sample from a Poisson distribution with parameter λ. Find the method of moments estimate for λ. 23. Find the method of moments estimator for the parameter p of a geometric distribution. 25. Using the method of moments estimator for p found in Exercise 23, find an estimator for σ 2 for the geometric distribution. (You don t have to do the rest of this question that is in the text.) 27. Carbon dioxide is an odorless, colorless gass that constitutes about.035% by volume of the atmosphere. It affects the heat balance by acting as a one-way screen. It lets in the sun s heat to warm the oceans and the land but blocks some of the infrared heat that is radiated from the earth. This reflected heat is absorbed into the lower atmosphere, producing a greenhouse effect which causes the earth s surface to become warmer than it would be otherwise. Systematic measurements of CO 2 began in 1957 with Charles D. Keeling monitoring at Mauna Loa in Hawaii. (a) Given the data (available on the web) that are CO 2 readings in ppm, construct a stem-and-leaf plot (by hand) for these data using 31, 32, 32, 33, 33, 34, 34, 35 at stems. Graph leaves 0-4 on the first of each repeated stem and leaves 5-9 on the other. Is it reasonable to assume that the CO 2 level in the atmosphere is normally distributed? Explain. (b) Estimate µ and σ 2 using the method of moments estimators. (c) Find an unbiased estimate for σ 2. 29. Based on the data of Exercise 27, what are the maximum likelihood estimates for the mean and variance of the atmospheric CO 2 level? 4

30. Let X 1, X 2,..., X m be a random sample of size m from a binomial distribution with parameters n, assumed to be known, and p. Find the maximum likelihood estimator for p. Does it differ from the method of moments estimator found in Exercise 16? 31. Let W be an exponential random variable with parameter β unknown. Find the maximum likelihood estimator for β based on a sample of size n. Does it differ from the method of moments estimator (derived in lecture)? 34. Computer terminals have a battery pack that maintains the configuration of the terminal. These packs must be replaced occasionally. Let X denote the life span in years of such a battery. Assume that X is exponentially distributed with unknown parameter β. Find the maximum likelihood estimate for β based on the given data (available on the web). 35. To esimate the proportion of defective microprocessor chips being produced by a particular maker, samples of five chips are selected at 10 randomly selected times during the day. These chips are inspected, and X, the number of defective chips in each batch of size 5, is recorded. Assume that X is binomially distributed with n = 5 and p unknown. Use the data given (available on the web) to find the maximum likelihood estimate for p. 54. Let X denote the unit price of a 3.5-inch floppy diskette. Observations are obtained from a random sample of 10 suppliers. (Data are available on web.) (a) Find an unbiased estimate for the mean price of these diskettes. (b) Find an unbiased estimate for the variance in the price of these diskettes. (c) Find the sample standard deviation. Is this an unbiased estimate for σ? (d) Assume that X is normally distributed. Find the maximum likelihood estimate for σ 2. Does this agree with your answer to (b)? 59. Consider the random variable X with density given by f(x) = (1/θ 2 )xe x/θ, x > 0 (b) Show that E(X) = 2θ. (c) Find the method of moments estimator for θ. (d) Find the maximum likelihood estimator for θ based on a random sample of size n. Does this estimator differ from that found in part (c)? (e) Estimate θ based on these data: 3 5 2 3 4 1 4 3 3 3 (f) Are the estimators found in parts (c) and (d) unbiased estimators for θ? Additional problems: 1. Which of the following statistics can be made arbitrarily large by making one number out of a batch of 100 numbers arbitrarily large: the mean, the median, the 10% trimmed mean, the standard deviation, the interquartile range? 2. Suppose X 1,..., X n are n identically distributed random variables with E(X i ) = µ, i = i,..., n. Show that (X) 2 is not an unbiased estimate of µ 2. 5

3. What general features are evident in a boxplot of data from a normal distribution? from a skewed distribution? from a distribution that is symmetric and bell-shaped like the normal distribution, but has less probability in the tails (the extreme values)? from a distribution that is symmetric and bell-shaped like the normal distribution, but has more probability in the tails (the extreme values)? 4. In data compression of text, a probability model is used where the probability of the next letter is heavily influenced by the preceding letters. In a first-order Markov model, the probability of the next letter depends only on the one letter immediately preceding it. Suppose we are interested in a model for the compression of a binary string. I ll label the values b for black and w for white. For a first-order Markov model we need the following probabilities for the value of a letter given the value preceding it: P (w w) = p w, P (b w) = 1 p w, P (b b) = p b, P (w b) = 1 p b Suppose X i is the random variable that is 1 if the ith letter is w and 0 if the ith letter is b. Then given that the (i 1)th letter is w (say), the probability function of X i is P (X i = x X i 1 = 1) = p x w (1 p w) 1 x. Suppose the string bbbbwwwbbbbbwwbbbbbbwwwwb is observed. Use maximum likelihood to estimate the parameters p w and p b. 6