STAT355 - Probability & Statistics Instructor: Kofi Placid Adragni Fall 2011
Chap 1 - Overview and Descriptive Statistics 1.1 Populations, Samples, and Processes 1.2 Pictorial and Tabular Methods in Descriptive Statistics 1.3 Measure of Location 1.4 Measures of Variability
1.1 Populations, Samples,... Discipline of statistics provides methods for organizing and summarizing data and for drawing conclusions based on information contained in the data. An investigation will typically focus on a well-defined collection of objects constituting a population of interest. When desired information is available for all objects in the population, we have what is called a census. Often, census is impractical or infeasible. A sample, - a subset of the population - is selected in some prescribed manner.
1.1 Populations, Samples,... A variable is any characteristic whose value may change from one object to another in the population. Data results from making observations either on a single variable or simultaneously on two or more variables. A univariate data set consists of observations on a single variable. We have bivariate data when observations are made on each of two variables.
1.1 Populations, Samples,... Descriptive statistics help summarize and describe important features of the data. Some are graphical in nature: histograms, boxplots, scatter plots,... Other involve calculation of numerical summary measures, such as means, standard deviations, and correlation coefficients. Software:R, S-Plus, Minitab, SAS, SPSS, Jmp, Stata,...
1.1 Populations, Samples,... Scope of Modern Statistics molecular biology (analysis of microarray data, SNPs,...) ecology (describing quantitatively how individuals in various animal and plant populations are spatially distributed) materials engineering (studying properties of various treatments to retard corrosion) marketing (developing market surveys and strategies for marketing new products) public health (identifying sources of diseases and ways to treat them) civil engineering (assessing the effects of stress on structural elements and the impacts of traffic flows on communities)... Meanwhile, statisticians continue to develop new models for describing randomness, and uncertainty and new methodology for analyzing data.
1.1 Populations, Samples,... Data Collecting Statistics deals not only with the organization and analysis of data once it has been collected but also with the development of techniques for collecting the data. Data not properly collected may be useless and misleading. Appropriate sampling scheme must be used. simple random sample stratified random sample...
1.2 Pictorial and Tabular Methods in Descriptive Statistics Visual representation of data Stem-and-Leaf Displays Dotplots Histograms Boxplots Frequency tables Pie charts Bar graphs Scatter plots...
1.2 Pictorial and Tabular Methods in Descriptive Statistics Stem-and-Leaf Displays Example: data 0.0-0.2-1.1-0.6-2.3 0.5-0.3 1.5 1.0 1.0 0.6-1.1-0.9-1.2 0.3 1.4 0.4-1.1 0.0 1.1 2.0-0.2 0.3-0.2 0.7 0.1-0.8 0.3 0.3 0.4-2 3-1 -1 2111-0 986-0 3222 0 001333344 0 567 1 0014 1 5 2 0
1.2 Pictorial and Tabular Methods in Descriptive Statistics Histograms
Wednesday, Sept 7 To cover... 1.3 Measure of Location 1.4 Measure of Variability
1.3 Measures of Location Some measures of location are: Mean Median Quartiles Percentiles Trimmed Means Data: Let X be the variable of interest. x 1, x 2,..., x n are observations X ; n is the number of observations, or sample size, or number of samples.
1.3 Measures of Location Data example: Caustic stress corrosion cracking of iron and steel has been studied because of failures around rivets in steel boilers and failures of steam rotors. Let X be the crack length (µm) as a result of constant load stress corrosion tests on smooth bar tensile specimens for a fixed length of time. x 1 = 16.1; x 2 = 9.6; x 3 = 24.9; x 4 = 20.4; x 5 = 12.7 x 6 = 21.2; x 7 = 30.2; x 8 = 25.8; x 9 = 18.5; x 10 = 10.3 x 11 = 25.3; x 12 = 14.0; x 13 = 27.1; x 14 = 45.0; x 15 = 23.3 x 16 = 24.2; x 17 = 14.6; x 18 = 8.9; x 19 = 32.4; x 20 = 11.8; x 21 = 28.5 The sample size is n = 21.
Mean The mean or the arithmetic average of the set is the most familiar and useful measure of the center. Let x 1, x 2,..., x n be a given set of numbers. The sample mean is denoted by x. If the set is y 1,..., y n, the sample mean is ȳ. Definition The sample mean x of observations x 1, x 2,..., x n, is given by x = 1 n n (x i=1 1 + x 2 +... + x n ) = x i n Example: x 1 = 16.1; x 2 = 9.6; x 3 = 24.9; x 4 = 20.4; x 5 = 12.7 The mean is x = (16.1 + 9.6 + 24.9 + 20.4 + 12.7)/5 = 16.7 (1)
Mean... Sample mean of x 1, x 2,..., x n : x The population mean is often denoted by µ. Let N be the total number of observations in the population. The population mean can be obtained as µ = (sum of the N population values)/n. (2) There is more to this population mean! A general definition for µ that applies to both finite and (conceptually) infinite populations will be visited later. Just as x is an interesting and important measure of sample location, µ is an interesting and important (often the most important) characteristic of a population.
Median Sample median is the middle value once the observations are ordered from smallest to largest. Notation: Denote observations by x 1,..., x n. The sample median is represented by x. Definition The sample median is obtained by first ordering the n observations from smallest to largest (with any repeated values included so that every sample observation appears in the ordered list). Then, { x = ( n+1 2 )th ordered value if n is odd average of ( n 2 )th and( n 2 + 1)th ordered values ifn is even. (3)
Median Example: A sample of n = 12 recordings of Beethovens Symphony #9, yielding the following durations (min) listed in increasing order: 62.3 62.8 63.6 65.2 65.7 66.4 67.4 68.4 68.8 70.8 75.7 79.0 The sample median is the average of the n/2 = 6 th and (n/2 + 1) = 7 th values from the ordered list: x = (66.4 + 67.4)/2 = 66.9 Notes: If the largest observation 79.0 was excluded from the sample, the resulting sample median for the n = 11 remaining observations would have been the single middle value 66.4 (the [n + 1]/2 = 6 th ordered value, i.e. the 6 th value in from either end of the ordered list). The sample mean is x = 68.01, a bit more than a full minute larger than the median.
Median Remarks and Notation: The population median is denoted by µ The sample median is very insensitive to outliers. If the median salary for a sample of engineers were x = 66, 416, we might use this as a basis for concluding that the median salary for all engineers exceeds 60, 000. The population mean µ and median µ will not generally be identical. When this is the case, in making inferences we must first decide which of the two population characteristics is of greater interest and then proceed accordingly.
Quartiles, Percentiles, and Trimmed Means Quartiles divide the data set (sample or population) into four equal parts. Observations above the third quartile Q3 constituting the upper quarter of the data set. The second quartile Q2 is the median. The first quartile Q1 separates the lower quarter from the upper three-quarters. Example: Beethovens Symphony #9 data - Q1 = 64.80; Q2 = x = 66.90; Q3 = 69.30
Quartiles, Percentiles, and Trimmed Means A data set (sample or population) can be even more finely divided using percentiles; the 99 th percentile separates the highest 1% from the bottom 99%, and so on. The mean is quite sensitive to a single outlier, whereas the median is not affected by outliers. A trimmed mean is a compromise between x and x to the robustness to outliers. A 10% trimmed mean, for example, would be computed by eliminating the smallest 10% and the largest 10% of the sample and then averaging what remains.
Mean, Median, Quartiles, Percentiles, and Trimmed Means Example: The production of Bidri is a traditional craft of India. Bidri wares (bowls, vessels,...) are cast from an alloy containing primarily zinc along with some copper. The following observations are on copper content (%) for a sample of Bidri artifacts: 2.0 2.4 2.5 2.6 2.6 2.7 2.7 2.8 3.0 3.1 3.2 3.3 3.3 3.4 3.4 3.6 3.6 3.6 3.6 3.7 4.4 4.6 4.7 4.8 5.3 10.1 Stem-and-Leaf display 2 04566778 3 012334466667 4 4678 5 3 6 7 8 9 10 1
Mean, Median, Quartiles, Percentiles, and Trimmed Means A prominent feature is the single outlier at the upper end. The sample mean and median are 3.65 and 3.35, respectively. A trimmed mean with a trimming percentage of 100(2/26) = 7.7% results from eliminating the two smallest and two largest observations; this gives x tr(7.7) = 3.42 Trimming here eliminates the larger outlier and so pulls the trimmed mean toward the median. A trimmed mean with a moderate trimming percentage (between 5% and 25%) will yield a measure of center that is neither as sensitive to outliers as is the mean nor as insensitive as the median. If the desired trimming percentage is 100α% and nα is not an integer, the trimmed mean must be calculated by interpolation.
If we let x denote the number in the sample falling in category 1, then the number in category 2 is nx. The relative frequency or sample proportion in category 1 is x/n and the sample proportion in category 2 is 1 x/n. Categorical Data and Sample Proportions Example: If a survey of individuals who own digital cameras is undertaken to study brand preference, then each individual in the sample would identify the brand of camera that he or she owned, from which we could count the number owning Canon, Sony, Kodak, and so on. When the data is categorical, a frequency distribution or relative frequency distribution provides an effective tabular summary of the data. Consider sampling a dichotomous populationone that consists of only two categories (such as voted or did not vote in the last election, does or does not own a digital camera, etc.).
Categorical Data and Sample Proportions Lets denote a response that falls in category 1 by a 1 and a response that falls in category 2 by a 0. A sample size of n = 10 might then yield the responses 1, 1, 0, 1, 1, 1, 0, 0, 1, 1. The sample mean for this numerical sample is (since number of 1s is x = 7) x n = x 1 + x 2 +... + x n n 1, +1 + 0 + 1 + 1 + 1 + 0 + 0 + 1 + 1 = 10 (4) = 7 10. (5) The sample proportion of observations in the category is the sample mean of the sequence of 1s and 0s. Thus a sample mean can be used to summarize the results of a categorical sample. Analogous to the sample proportion x/n of individuals or objects falling in a particular category, let p represent the proportion of those in the entire population falling in the category.
Categorical Data and Sample Proportions As with x/n, p is a quantity between 0 and 1, and while x/n is a sample characteristic, p is a characteristic of the population. The relationship between the two parallels the relationship between x and µ and between x and µ. In particular, we will subsequently use x/n to make inferences about p. Example: A sample of 100 car owners reveals that 22 owned their car at least 5 years, then we might use 22/100 =.22 as a point estimate of the proportion of all owners who have owned their car at least 5 years. With k categories (k > 2), we can use the k sample proportions to answer questions about the population proportions p 1,..., p k.
1.4 Measures of Variability Some measures of variability Range (min max) Interquartile Range (Q3 Q2) Variance or standard deviation Different samples or populations may have identical measures of center yet differ from one another in other important ways. Example:
Sample Variance Definition The sample variance of x 1, x 2,..., x n, denoted by s 2, is given by s 2 = 1 n 1 n (x i x) 2 (6) The sample standard deviation, denoted by s, is the (positive) square root of the variance: i=1 s = s 2 (7) Remarks: s 2 and s are both nonnegative. The unit for s is the same as the unit for each of the x i s.
Sample Variance Computing remark: s 2 = 1 n 1 n (x i x) 2 = i=1 = [ n 1 n 1 i=1 [ n 1 n 1 i=1 x 2 i n( x) 2 ] x 2 i (8) ] ( n i=1 x i) 2.(9) n Example: Find the variance and standard deviation of 154 142 137 133 122 126 135 135 108 120 127 134 122 Step 1: Form and find n i=1 x 2 i = 154 2 + 142 2 +... + 122 2 = 222581 Step 2: With x = 130.4, calculate n( x) 2 = 1695 Step 3: The variance is s 2 = (222581 1695)/12 = 131.6 The standard deviation is s = 131.6 = 11.5.
Mean and Variance Proposition Let x 1, x 2,..., x n be a sample and c be any nonzero constant. If y i = x i + c for i = 1,..., n, then ȳ = x + c and s 2 y = s 2 x. If y i = cx i for i = 1,..., n, then ȳ = c x, s 2 y = c 2 s 2 x, and s y = c s x. where s 2 y and s 2 x are the sample variances for respectively the x s and y s.
Five-Number Summaries and Boxplots With x 1, x 2,..., x n, the five-number summary is given by (minimum, first quartile Q1, median, third quartile Q3, maximum) (smallest x i, lower fourth, median, upper fourth, largest x i ) A boxplot (aka box-and-whisker plot) is a way of graphically depicting groups of numerical data through their five-number summaries. Remark: A boxplot may also indicate which observations, if any, might be considered outliers. Using the Bidri data set, we have
Five-Number Summaries and Boxplots
Comparative Boxplots Suppose we have two sets of data as x: 8.87 4.98 11.23 21.03 10.33-4.03 9.70 7.67 11.64 1.73 1.78 4.83-1.63 13.52 4.12 5.69 13.91 7.56 17.15 8.84 13.08 8.18 10.28 4.67 16.54 12.18 2.97 9.35 10.70 10.91 y: 8.94 5.61 6.44 15.38 6.60 5.81 9.33 10.93 7.69 4.98 15.16 7.87 6.04 4.74 4.81 8.68 5.12 8.93 18.89 8.33 4.10 11.77 8.37 6.50 3.90 11.98 8.02 5.89 6.35 8.43
Exercise 78 Consider a sample x 1,..., x n and suppose that the values of x, s 2 x, and s x have been calculated. a. Let y i = x i x for i = 1,..., n. How do the values s 2 y and s y for the y i s compare to s 2 x, and s x? Explain or justify. b. Let z i = (x i x)/s x for i = 1,..., n. What are s 2 z and s z, the variance and standard deviation for the z i s?