1 Organizing and Graphing Data

Transcription

1 1 Organizing and Graphing Data 1.1 Organizing and Graphing Categorical Data After categorical data has been sampled it should be summarized to provide the following information: 1. Which values have been observed? (red, green, blue, brown, orange, yellow) 2. How often did every value occur? Categorical data is usually summarized in a table giving the following information: categories observed frequency, or number of measurements for each category relative frequency, or proportion of measurements for each category percentage of measurements for each category Definition: The relative frequency for a particular category is the fraction or proportion of the frequency that the category appears in the the data set. It is calculated as Relative frequency of a category = frequency of that category Sum of all frequencies percent = 100 Relative Frequency Example: Sum of all frequencies = sample size = number of observations=n=200 category frequency relative frequency percentage wood % tiles % linoleum % carpet % total % Such a table is called the frequency distribution table for categorical data. Once the data is summarized in a frequency distribution table, the data can be displayed in a bar chart or pie chart. The bar chart (bar graph) will effectively show the frequencies in the different categories whereas the pie chart will show the relationship between the parts and the whole. 1

2 1.1.1 Bar Graph Definition 1 A graph made of bars whose heights represent the frequencies of respective categories is called a bar graph. Instead of frequencies a bar graph might display the relative frequencies or percentages of the categories. For every category the x-axis is marked with a tick. Each category is represented by a bar, which AREA is proportional to the corresponding frequency (relative frequency). label the y-axis. Remark: The width of each bar should be the same, so the height is proportional to the corresponding frequency. Example 1 Suppose the frequency distribution of the mainly used flooring products is: frequency relative freq wood tiles linoleum carpet

3 1.1.2 Pie Charts Pie charts provide an alternative kind of graph for categorical data: Definition 2 A circle divided into portions that represent the relative frequencies or percentages of a population or sample belonging to different categories is called a pie-chart. The size of the slice representing a particular category is proportional to the corresponding frequency (relative frequency) that fall within this category. How to create a pie chart: Draw a circle Calculate the slice size (angle) (fraction of the circle for the category) use protractor to mark the angles slice size=category relative frequency 360 frequency relative freq angle wood tiles linoleum carpet

4 M&M s example: On the M&M s webpage the following information on the distribution of colors in peanut M&M s is provided color brown yellow red blue orange green percent 12% 15% 12% 23% 23% 15 In order if this distribution is a true description of what is in a bag, someone bought a bag with 200 peanut M&M s and wants to describe the colors of the contents. Color is a categorical variable, so a relative frequency table shall be obtained. color count rel. freq. percentage brown % yellow % red % blue % orange % green % Total % And a bar chart would look like this: For the pie chart the angles of the slices have to be determined color count rel. freq. angle brown o yellow o red o blue o orange o green o Total o This results in the following pie chart 4

5 1.2 Organizing and Graphing Quantitative Data Graphs from this section display the data for a quantitative variable in a fashion so that the distribution of the data becomes apparent Stem and Leaf Plots Another way of displaying numerical data is the stem and leaf plot. Each observed number is broken into two pieces called the stem and the leaf. How to do a stem and leaf plot: 1. Divide each measurement into two parts: The first digit(s) of the number are the stems. The last digit(s) of the number are the leaves. 2. List the stems in a column, with a vertical line to their right. 3. For each measurement, record the leaf portion in the same row as its corresponding stem. 4. Order the leaves from lowest to highest in each stem. 5. Provide a key to your stem and leaf coding so that the reader can recreate the actual measurements. Example 2 Acceptance rates at some business schools: 16.3, 12.0, 25.1,20.3, 31.9, 20.7, 30.1, 19.5, 36.2, 46.9, 25.8, 36.7, 33.8, 24.2, 21.5, 35.1, 37.6, 23.9, 17.0, 38.4, 31.2, 43.8, 28.9, 31.4, 48.9 Stem and Leaf Plot:

6 stem=tens leaf=tenth It shows: center, range, concentration, nature of distribution (unimodal, bimodal, multimodal), unusual values, skewed to the right/left. Sometimes the available stem choices result in a plot that contains too few stems and a large number of leaves within each stem. In this situation you can stretch the stems by dividing each into several lines. The two common choices for dividing stems are: Into two lines, with leaves 0 to 4 and 5 to 9 into 5 lines, with leaves 0-1, 2-3, 4-5, 6-7, 8-9 Example:(acceptance rates) You also can use stem and leaf plots for the comparison of the distribution of two groups: Relative Frequency Histograms The most common graph for describing numerical continuous data is the histogram. It visualizes the distribution of the underlying variable, that is: how many measurements are found where on the measurement scale. How a histogram looks like: 6

7 Definition: A relative frequency histogram for a quantitative data set is a bar graph in which the hight of the bar shows how often (measured as a relative frequency) measurements fall in a particular interval. The classes or intervals are plotted along the horizontal axis. The first step into creating a histogram, is finding the frequency distribution of the variable of interest. Definition 3 A frequency distribution for quantitative data lists all the classes and the number of values that belong to each class. How to obtain a frequency distribution: 1. Decide which class intervals (preferably of equal length) to use for the frequency distribution. Each class is given through its lower boundary and its upper boundary. The class width= upper boundary - lower boundary. The number of class intervals used should be approximately the square root of the sample size, but not lower than 4 and not larger than 20. Use sensible interval boundaries: The intervals should have if possible the same width and the boundaries should be rounded numbers (if possible whole numbers or tenth, or multiples). 2. Create a frequency table for the class intervals using the method of left inclusion. List the class intervals and the frequency of values falling within this interval. Also give the relative frequencies for each class interval. These relative frequencies can now be displayed in a histogram. To obtain the histogram from the frequency distribution, follow the following steps: 1. Mark the boundaries of the class intervals on a horizontal axis. 2. Use the relative frequency on the vertical axis. 7

8 3. Draw a bar for each class interval, with heights according to the relative frequency of the corresponding class interval. Example 3 Histogram for acceptance rates: 1. The sample size is 25, the square root is 5, but we will use 4 class intervals, because of the range is about 10-50, which is easily divided into intervals [10, 20), [20, 30), [30, 40), [40, 50) 2. class intervals frequency relative frequency [10, 20) [20, 30) [30, 40) [40, 50) This graph uses the frequency (relative frequency is a better choice the intervals have the same width!) It shows: center, range, concentration, nature of distribution (unimodal, bimodal, multimodal), unusual values, skewed to the right/left. 8

9 Features to check for in a histogram 1. center, where is the middle of the data? 2. range, the data fall between which values (here:40 and 100). 3. number of peaks: unimodal(just one peak), bimodal (often occurs if you have observation from two groups (men, women)(two peaks), multimodal(more than 2 peaks) 4. symmetry: if you can draw a vertical line so that the part to the left is a mirror image of the part to the right, then it is symmetric. 5. nonsymmetric graphs are skewed. If the upper tail of the histogram stretches out farther than the lower tail, then is the histogram positively skewed, or skewed to the right. 6. Is the lower tail longer than the upper tail the histogram is negatively skewed. 7. Check for outliers. 9

10 2 Numerical Descriptive Measures methods for describing data JUST FOR NUMERICAL VARIABLES!! 2.1 Measures of Central Tendency The mean of a set of numerical observation is the familiar arithmetic average. To write the formula for the mean in a mathematical fashion we have to introduce some notation. Introduction of notation: x= the variable for which we have sample data n= sample size = number of observations x 1 =the first sample observation x 2 =the second sample observation. x n = the nth sample observation For example, we might have a sample of n=4 observations on x=battery lifetime(hr): x 1 =5.9, x 2 =7.3, x 3 =6.6, x 4 =5.7, The sum of x 1, x 2,..., x n can be denoted by but this is cumbersome. x 1 + x x n The Greek letter Σ is traditionally used in mathematics to denote summation. In particular Σ n i=1x i will denote the sum of x 1,, x n. Abbreviation Σx is used in the book. For the example above Σ 4 i=1x i = x 1 + x 2 + x 3 + x 4 = =

11 Definition: The sample mean of a numerical sample x 1, x 2,..., x n denoted by x is x = sum of all observations number of observations = x 1 + x x n n = Σn i=1x i n The mean battery life is x = = = Another number to describe the center of a sample is the median. The median is the value that divides the ordered sample in two sets of the same size, so that 50% of the data is less than this number (and 50% is greater than this number). Definition: The sample median, M, is determined by first ordering the n observation from smallest to largest. Then { the single middle value if n is odd M = sample median = the average of the middle two values if n is even Example: Suppose you have the following ordered sample of size 10: The median would be in this case the mean of the fifth and sixth observation (6+7)/2=6.5 and the sample mean is x = The median of the sample is the third observation which is 8, the sample mean x = 7.4. Comparing mean and median The mean is the balance point of the distribution. If you would try to balance a histogram on a pin, you would have to position the pin at the mean in order to succeed. The median is the point where the distribution is cut into two parts of the same area. In a symmetric distribution mean and median are equal. In a positively skewed distribution the mean is greater than the median. In a negatively skewed distribution the mean is smaller than the median. 2.2 Measures of Dispersion for numerical data It is not enough just to report a number that describes the center of a sample. The spread, the variability in a sample is also an important characteristic of a sample. Examples: graphs Definition: The range of a sample is the difference between the largest and the smallest value in the sample. Range = largest value - smallest value. Usually the greater the range the larger the variability. However, variability depends on more than just the distance between the two most extreme values. It is a characteristic of the whole data set and every observation contributes to it. 11

12 Sample 1: * * * * * o * * * * * Sample 2: * ****O**** * Definition The n deviations from the sample mean are the differences x 1 x, x 2 x, x n x A specific deviation is greater than zero if the value is greater than x and negative if it is less than x. The set of deviations describes the variability of the data set, but n i=1 (x i x)=0. If you square every deviation before summing them up, you will receive a number that characterizes the variability in the data set. Definition: The sample variance, denoted by s 2, is the sum of squared deviations from the mean divided by n 1. That is ni=1 s 2 (x i x) 2 = n 1 The sample standard deviation is the positive square root of the sample variance and is denoted by s. s = s 2 = ni=1 (x i x) 2 n 1 For calculating the sample variance for a given sample the following formula is easier to compute: s 2 = x 2 i ( x i ) 2 n 1 n 12

13 Example: Calculate the standard deviation of the 4 battery lives. i x i x i x (x i x) 2 x 2 i Σ The sample variance is s 2 = 1.589/3 = and the sample standard deviation is s = = Using the other formula, first calculate s 2 = With this we get s = 0.53 = Measures of position ni=1 x 2 i ( n i=1 x i) 2 n = = 3 = 1.59 = The concept of the median can be generalized, by asking for the number so that k% (instead of 50%) falls below the number. Definition: For any particular number k between 0 and 100, the k th percentile is a value such that k percent of the observations in the data set fall at or below that value. With this definition, the median is the 50 th percentile, 50% of the data fall below the median. n An alternative measure of variability is the interquartile range. Like the mean the standard deviation is greatly affected by outliers. The interquartile range is as the median resistant to outliers. It is based on quantities called quartiles. Definition: The lower quartile Q 1 is the 25th percentile, 25% of the data fall below it. The median Q 2 is the 50th percentile, 50% of the data fall below it. The upper quartile Q 3 is the 75th percentile, 75% of the data fall below it (and 25% above). The middle 50% of the measurements fall between the lower and upper quartile. The quartiles of a sample are obtained by: 13

14 1. Divide the n ordered observations into a lower and an upper half; if n is odd, the median is excluded from both halves. 2. The lower quartile Q 1 is the median of the lower half. 3. The upper quartile Q 3 is the median of the upper half. Example: Q1 med Q3 Definition: The interquartile range (IQR) is given by IQR = upper quartile lower quartile=q 3 Q 1 The IQR in the example is IQR= 8 5 = 3. The middle 50% of the data points in this sample are captured in an interval not longer than Summarizing a data set with a Boxplot The boxplot is a powerful graphical tool for summarizing data It shows the center, the spread, and the symmetry or the skewness at the same time. It is based on the median, the iqr, and the minimum and maximum of the observations. Construction of a boxplot 1. Draw a horizontal or vertical measurement scale. 2. Draw a rectangular box, whose lower edge is at the lower quartile and whose upper edge is at the upper quartile. 3. Draw a line segment inside the box at the location of the median. 4. Add line segments from each end of the box to the smallest and largest observation in the data set. Example: Sample of pulse after exercise of size 92. mean=80 median= 76.0, min=50, max=140, q l =68, q u =87 14

15 A boxplot can be supplied with even more information. Sometimes a star * is added for the mean. This will help to give a visual comparison between mean and median. In addition outliers may me identified in the boxplot. In order to do this, we first have to define, what an outlier is. Definition: An observation is called an outlier if it is more than 1.5 iqr away from the closest quartile. In order to determine if there is an outlier present in the data set calculate upper fence = upper quartile iqr, every measurement above the upper fence is an upper outlier lower fence = lower quartile 1.5 iqr, every measurement below the lower fence is called a lower outlier. Example: The iqr in the example is 87 68=19. (1.5 *19)=28.5. upper fence = *19= The maximum equals 140, so the data contains at least one upper outlier. lower fence = *19=39.5. The minimum equals 50, so that there is no lower outlier present. Outliers may be marked by a circle or a star in a box plot. In this case the whiskers only extend to the smallest and largest non outliers. One can create comparative boxplots by drawing several boxes in one graph. This is a good tool for comparing continuous variables in different categories. Example: Resting pulse and pulse after exercise boxplots in one graph. 15

16 16

17 3 A four step process 1. STATE: What is the practical question in context of the discipline? 2. PLAN: What statistical tool(s) have to be employed to find an answer? 3. SOLVE: Make the graphs and calculations necessary. 4. CONCLUDE: Give the answer to the question STATEd above in the context of the discipline. Example (Logging in the Rainforest(pg.57): 1. STATE Does logging the tropical rain forest result in its destruction? To answer this question we have data on the number of trees per acre on plots that had never been logged (Group 1), that had been logged 1 year earlier (Group 2), and plots that had been logged 8 years earlier. 2. Plan: Do side by side boxplots and descriptive statistics for the data from the 3 groups. 3. Solve: GROUP N Mean Median StDev Minimum Maximum Q1 Q Conclude The numerical summary as well as the boxplot suggests, that logging results in average in a smaller number of trees per acre, whereas the standard deviation seems to be almost unchanged. 17