Comments 2 For Discussion Sheet 2 and Worksheet 2 Frequency Distributions and Histograms Discussion Sheet 2 We have studied graphs (charts) used to represent categorical data. We now want to look at a table and a kind of graph for representing numerical (as opposed to categorical) data: frequency distributions and histograms. We sometimes want to make tables that show us the shape of a numerical data set for which numbers there are the most cases, if it is skewed towards small or large values or is fairly symmetrical, if there are large gaps in the data, if there are unusually large or small values, and so on. It is important to get a feel for the shape or structure of a data set with which you are working. Constructing Frequency Distributions To make a frequency distribution or relative frequency distribution, follow these steps: 1. Calculate the range of the data: Range = largest value smallest value. 2. Determine how many (between 5 and 20) classes (intervals) of data will be needed to cover the range. The rule of thumb is to use a number of classes approximately equal to the square root of the number of values in the data set, but no more than 20 and no less than 5. The idea is to pick a number of classes that will show the structure of the data without picking so many classes that there will only be a few numbers in each class. 3. Divide the range by the number of classes (intervals) to determine the width of each interval. The classes will all be of equal width, that is, consist of an interval of the range that is the same size as each of the other intervals. 4. Determine the upper and lower bound of each class (interval). You are dividing the range into a set of intervals that do not overlap and that together cover the range from smallest to largest value. Adjust the bounds of each class so that it is not a number in the data set. For example, if the data in the set are integers, you might change the boundaries to end with.5. We want each value in the data set to fall into one and only one of these classes. 5. Determine the number of data values that fall into each class. This is called the class frequency for that class. 6. Make a table listing the classes in one column and the class frequencies next to them in another column. This kind of table is called a frequency distribution. 7. Alternatively, we could determine what percentage of the total number of data values from the data set lie in each class by dividing the class frequency by the total and multiplying by 100. This is called the class relative frequency. 8. We could make a table, just as in Step 6 but using the class relative frequencies instead of the class frequencies. This kind of table is called a relative frequency distribution. 1
1. The data on female cholesterol for a sample of 20 are given below. Make a frequency distribution for this data set. Sex Cholesterol FEM 215 FEM 257 FEM 212 FEM 238 FEM 163 FEM 171 FEM 196 FEM 187 FEM 405 FEM 232 FEM 155 FEM 309 The frequency distribution should have 5 to 20 classes. The approximate number is 20 4.47214 5. If we extend the data from 150 (below the lowest value of 155) to 450 (above the highest value of 405), we could use 6 classes of size 50 to get from 150 to 450. Let s do that. So 450 150 our class size is = 300 = 50. Since we want non-overlapping classes (intervals) that cover 6 6 the range from 150 to 450, we could have 150-200, 200-250, 250-300, 300-350, 350-400, and 400-450. Since the numbers 150, 200, 250, 300, 350, 400 and 450 do not appear among the values in our data set, we can use these as boundaries for the classes and have exactly one class into which to put every number in the data set. If one of these numbers, say 200, was among the data values, we could avoid problems by setting the boundaries as 150.5, 200.5, 250.5 and so on (since all of our data values are integers and thus none of them could be one of these boundaries). With these classes, we then make a table and show the frequency of values in each class. If we look at the set of classes, we see this distribution of the data values: 150-200 167, 167, 198, 198, 163, 171, 196, 187, 155 200-250 234, 215, 212, 238, 234, 232 250-300 271, 257, 271 300-350 309 350-400 400-450 405 Of course, for a frequency distribution, we don t want the actual values in each class but the frequency (number of values) in each class. Counting them from above, we get Class Frequency 150-200 9 200-250 6 250-300 3 300-350 1 350-400 0 400-450 1 2
2. Make a relative frequency distribution for these data. Once we have a frequency distribution, the relative frequency distribution is easy to find. We just need to convert the frequency for each class into a percentage by dividing by the total number of data values and multiplying by 100: 100 9 20 = 45% 100 6 20 = 30% 100 3 20 = 15% 100 1 20 = 5% This then gives us the relative frequency distribution: Class Percent 150-200 45 200-250 30 250-300 15 300-350 5 350-400 0 400-450 5 3. How are the frequency distribution and the relative frequency distribution the same and how are they different? Both frequency distributions have the same classes. For the frequency distribution, the actual count or frequency of data values in each class is shown. For the relative frequency distribution, the percentage of the total number of data values in each class is shown. They show the same shape (center, spread, skew, gaps, unusually high or low values, etc.) but it may be easier to estimate the size in percentages rather than actual counts, especially when the number of data values in the data set is large. Constructing Histograms A histogram (so called because it was first used in picturing numbers of different types of blood cells) is essentially a bar graph (usually vertical) in which is category is a class from a frequency distribution or relative frequency distribution. Because the classes of a frequency distribution form a continuous set of intervals covering the range of data, the bars of a histogram lie next to each other and are not separated by spaces. One axis (scale) of the graph is the set of categories from the frequency distribution. The other axis (scale) can be the number of data values from each class. If so, this is called a histogram and that axis is labeled Number. It can, alternatively, be relative frequency of each class. If so, this is called a relative frequency histogram and the axis is labeled Percent. 3
4. Make a histogram of the cholesterol data. 10 8 6 4 2 150 300 F Chol 5. Which makes it easier to see the structure or shape of the data set, the frequency distribution or the histogram? For most people it is easier to get a sense of the shape or structure of a distribution (center, spread, skew, gaps, unusually high or low values, etc.) from a picture than from a table of numbers. This means that for most people the histogram makes it easier to see the structure or shape of the data set than the frequency distribution does. Worksheet 2 The data on female cholesterol for a sample of 20 used in Discussion Sheet 2 are given below: Sex Cholesterol FEM 215 FEM 257 FEM 212 FEM 238 FEM 163 FEM 171 FEM 196 FEM 187 FEM 405 FEM 232 FEM 155 FEM 309 4
1. Make a relative frequency histogram of these data (you probably will want to use the relative frequency distribution you made in Discussion Sheet 2. Percent 50 10 40 8 30 6 20 4 10 2 150 300 F Chol 2. How is the size and shape of this relative frequency histogram the same and how is it different from the histogram you made for the same data in Discussion Sheet 2? The size and shape of the histogram and the relative frequency histogram are the same. The only difference is that the vertical axis is scaled with numbers (frequencies) for the histogram and with percents for the relative frequency histogram. 3. Which reveals more about the shape or structure of the data: the relative frequency distribution or the relative frequency histogram for the same data? For most people it is easier to get a sense of the shape or structure of a distribution (center, spread, skew, gaps, unusually high or low values, etc.) from a picture than from a table of numbers. This means that for most people the relative frequency histogram makes it easier to see the structure or shape of the data set than the relative frequency distribution does. 4. Which reveals more about the shape or structure of the data: the histogram or the relative frequency histogram for the same data? Since their size and shape are exactly the same, they reveal the same thing about the shape and structure of the data so neither reveals more than the other. 5. Why might you use a relative frequency histogram instead of a simple histogram to picture a data set? When there are a large number of values or an unusual number of values (for example, 23), percentages are more familiar than the actual counts would be. In that case, we might have a better sense of the data from a relative frequency histogram instead of a simple histogram. 5