What proportion of those surveyed completed high school only? 58941/ =.3887 or 38.87% complete high school only.

Transcription

1 Unit 2: SUMMARIZING DATA Topic 6: Two-Way Tables In Topic 2, we learned how to construct a bar chart to represent data for a single categorical variable. Often, we are interested in relationships among two or more categorical variables (sex, race, occupation, etc.). Information on two categorical variables can be best represented in a simple two-waytable of counts, as shown in the example to follow. All analysisdone on categorical variables involves nothing more than computing proportions and summing values. Example: Data on educational attainment of Americans of different ages were collected by the Census Bureau in The two-way table below contains counts of the number of people who fell into each education/age category listed on the sides of the table. Totals for each level of educational attainment and age group are given in the last column and row respectively. So, for example, of the 151,616 total people surveyed, 13,183 were over the age of 65 and did not complete high school Data Age Group Education Total Didn t complete high school Completed high school College: 1-3 years College, 4 or more years Total Note: This two-way table may also be called a 4x5 table to indicate the number of levels of each of the 2 variables. The 4 represents the number of education categories (row variable) and the 5 represents the number of age group categories (column variable). Typically, if an explanatory variable and a response variable can be clearly identified, the explanatory variable will be the column variable, and the response variable will be the row variable. 15

2 Virtually all aspects of the relationship between the variables Educational Attainment and Age Group can be found using proportions. To see this consider the following questions. What proportion of those surveyed did not complete high school? Of the people surveyed, did not complete high school, giving a proportion of 36114/ =.2382 or 23.82% who did not complete high school. What proportion of those surveyed completed high school only? 58941/ =.3887 or 38.87% complete high school only. It is simple to verify that also 17.01% and 20.30% completed 1-3 years of college and 4 years of college respectively. These four percentages (23.82%, 38.87%, 17.01%, 20.30%) taken collectively comprise what is known as the marginal distribution of the variable Educational Attainment. Using the word marginal means we are only considering one of the variables (in this case: educational attainment), not the relationship between the two. So the marginal distribution gives the percentages of cases falling into each category of a single categorical variable. Marginal distributions of cate- 40% gorical variables are best 30% represented using bar graphs as was done back in Topic 1. The 20% marginal distribution for educational attainment is 10% shown to the right. 0% Didn t Complete Completed College: 1-3 Years College: 4 or More Years Relationships Between Categorical Variables: Consider now the following question: What proportion of those surveyed between the ages of 35 and 44 completed4yearsofcollege? The italicizedpartof thisquestionisto 16

3 draw your attention to the fact that the proportion sought is only for those between the ages of 35 and 44. Since there are people in the age group and 9332 of them completed 4 years of college, then 9332/34682 =.2691 or 26.91% of the year-olds completed 4 years of college. Suppose we wanted to find the proportionof those between 35 and 44 at each of the 4 educational attainment levels. These proportions are:.1396,.3806,.2107, and.2691 respectively. Because we are finding these proportions under the condition that the respondents are between 35 and 44 years of age, these values comprise the conditional distribution of the educational attainment variable for the age group Probably the best way to visually represent these conditional distributions between categorical variables is through what is known as a segmented bar graph. A segmented bar graph for these data is shown below. Does there appear to be a relationship between educational attainment and age? Describe this relationship. 100% 80% 60% 40% Education Level College: 4 or More Years College: 1 3 Years Completed Didn t Complete 20% 0% >65 Age Group R Note: To create simple bar graphs in R, see pages of the R- Commander manual. To create a segmented bar graph in R, see pages and of the R-Commander manual. Note: These same data were compiled from a sample taken as part of the 2010 census and appear on the next page along with a segmented bar graph. What differences did you see? 17

4 2010 Data Age Group Education Total Didn t complete high school Completed high school College: 1-3 years College, 4 or more years Total Has the relationship between educational attainment and age changed over the past 25 years? 100% 80% 60% 40% Education Level College: 4 or More Years College: 1 3 Years Completed Didn t Complete 20% 0% >65 Age Group Independence: In the age vs. education level example, it was clear that the distribution of education levels depended on the age group. For example, the younger age groups tended to have attained higher levels of education than the older age groups for the 1987 data. Whenever the conditional distribution of one variable is identical for every category of the other variable, the two variables are said to be independent. To illustrate this idea, consider the following data describing the relationship between students residency status and gender for 100 students. Intuitively, would you expect there to be a relationship between gender and residency status, or would you expect these variables to be independent? 18

5 Male Female Note that regardless of whether a In-State student is male or female, the Out-of-State probability of having in-state residency is 3/4 and the probability of not having residency is 1/4. By the same token, regardless of a student s residency, the probability of being male is 4/10 and the probability of being female is 6/10. Since the distribution of one variable does not depend on the level of the other variable, the variables residency status and gender are independent. Relative Risk: Sometimes it is informative to look at the ratio of two proportions to gain information about the relative likelihood of occurrence of two events. Such a ratio of proportions is known as the relative risk between the two groups. For example, if the proportion of men between the ages of 50 & 59 with coronary heart disease (CHD) is 0.5 and the proportion of men between the ages of 30 & 39 with CHD is 0.2, the relative risk of a man having CHD between the two age groups is 5/2 = 2.5. This says that a randomly selected man in his 50 s is 2.5 times as likely to have CHD than a randomly selected man in his 30 s. In the age vs. education level example, what is the relative risk among those over 65 between not completing high school and completing 4 or more years of college? Simpson s Paradox: Consider the following data collected to investigate a possible influence of race on the imposition of the death penalty for murder. Data on the race of the defendant in a murder trial and whether or not the death penalty was given appear in the table below. 19

6 Are there a higher percentage Death Penalty? of whites or blacks sentenced Defendant s Race Yes No Total to death overall? Computing, White of 160 or % of the Black whites and 17 of 166 or 10.24% of the blacks are sentenced to Total death. So overall, more whites than blacks are sentenced to death when facing the death penalty for murder. The Paradox: Suppose we consider a third variable, the race of the murder victim. A table incorporating this additional information is given below: White Defendant: Black Defendant: Death Penalty Death Penalty Yes No Yes No White Victim White Victim Black Victim 0 9 Black Victim 6 97 Now consider the following pair of questions: 1. Among the cases where the victim was white, was there a higher percentage of whites or blacks sentenced to death? 2. Among the cases where the victim was black, was there a higher percentage of whites or blacks sentenced to death? Answers: 1. Among the cases with white victims, 19 of 151 or 12.58% of whites were sentenced to death, and 11 of 63 or 17.46% of blacks were sentenced to death. 2. Among the cases with black victims, 0 of 9 or 0.00% of whites were sentenced to death, and 6 of 103 or 5.83% of blacks were sentenced to death. 20

7 What happened here?!? Although a higher percentage of whites were sentenced to death overall, a higher percentage of blacks were sentenced to death both when the victim was white and when the victim was black! Can you explain this paradox? Bottom Line: Additional variables, such as the race of the victim here, can playanimportantroleintheanalysisofdata,andcanchangeourperceptions and conclusions. Variables of this type are known as lurking variables, as first considered in Topic 3. Topic 7: Displaying and Describing Distributions (Quantitative Data) When analyzing a distribution, there are many features one might examine. Five of these are outlined below. 1. Center: The center of the distribution is generally the most important and informative aspect of the distribution. Some measures of center with which we are familiar are the mean, median, or mode. 2. Spread or Variability: Giving the center of the distribution is not sufficient. It is also important to give some idea of how spread out the data are. Consider the two followingsets of data: (98, 99, 100, 101, 102) and (50, 75, 100, 125, 150). Clearly, these two lists have the same center (100), but the latter has much more variability than the former. 3. Shape: There are 3 common shapes one finds in examining distributions, though not all distributions fall into one of these categories. These shapes are: (a) Symmetric: (b) Skewed to the left: 21

8 Chapter 6 Notes Introduction: Chapter 6 is one of the hardest chapters for 216 students to understand, and to make it a big worse, it seems to come up very early in the course, as students are still trying to feel comfortable with the new vocabulary and concepts of statistics, etc. When we look for relationship between 2 quantitative variables, the task is relatively easy to visualize and the concept is rather straight forward to grasp. We put one variable on the y axis (the response variable) and the explanatory on the x axis, and, using (x, y) points, we plot them on a scatter plot. Then we visually and statistically find a line or curve of best fit which describes the points best. We can even quantify the relationship using statistical values and concepts like regression coefficient and r-squared values. We can describe various shades of relationship between the two quantities going from weak to very strong. When it comes to finding relationship between 2 categorical variables (the stuff of chapter 6) it becomes a bit harder and somewhat mysterious, as well as more vague. And that seems reasonable, since we are trying to relate 2 things which have categories and do so in some form of a quantitative way. For this task we use a process that involves making/using a 2-way-table, then compute conditional and sometimes marginal distributions, followed by a segmented bar graph. From these things we can make a determination of relationship, but it is simply that the 2 variables are relatively independent of each other, or that they are not independent of each other. And, the way we do this is rather strange, to the eyes of new statistics students. 2-Way-Tables: Two way tables have one variables categories which act as the row titles, and the categories of the other variable act as the column titles, with the counts of the observational units in the study who fit in each category of each variable shown as the cell values". The margins of the table contain totals counts of the rows and columns, respectively. The total-total value is shown in the bottom right cell, and represents the total number of observational units used in the study from which the table came. Statistical convention dictates that the variable whose categories make up the row titles is the y or response variable and the variable whose categories make up the column titles is the x or explanatory variable. When making up a 2-way-table, be sure that your total-total sums the same whether you are adding the marginal column or adding the marginal row!! Marginal Distributions: The marginal distribution is a listing of proportions, or fractions, whose denominator -1-

9 (bottom number) is ALWAYS the total-total number and whose numerator (top number) is one of the values in the total row or column. To know whether you are working with the row or column totals you need to answer the question: Do I want the marginal distribution of the x variable or y variable?. If the answer is the x variable, use the totals on the total row for marginal numerators, if the answer is the y variable, use the totals on the total column (on the right) for marginal numerators. Conditional Distributions: Like marginal distributions, conditionals are distributions of fractions, whose denominators are NEVER the total-total value, but rather are the total row or total column values. The numerators of these conditionals are individual cell counts. To determine what values to use for denominators, you have to sort of understand what a conditional means. I think about it this way. The whole world consists of the number of individuals listed in the total-total cell. If we are finding the conditional of the x variable (assuming the x categories are the column labels), then we reduce our world down to just those who live in the first category (i.e., the left column) of the world. My conditional (meaning my world is now reduced upon condition of just those being in the first category of the x variable), distribution is now found by finding the proportions in that left column with denominator as column total and numerator as cell value in each category of the y variable. Then I find the next condition, which is the proportions using the totals of all those who live in the next category of the x variable (i.e., those who live in the next column to the right)--and so on and on, until I have completed the conditional of every column in the x variable categories. To do conditionals of the y variable, (where my y has category names as row labels), my conditional distribution has proportions whose denominators are the various y variable category totals, and whose numerators are the cell values of that y category. We again start with the conditional of the first row, then the next row, etc. until finished with the distribution. Realize that this process is reversed (rows become columns and vice-versa), if the table is not done in the conventional way, and the x variable has categories for row titles and y variable has categories as column titles. To understand what I was attempting to explain above takes some practice, and is the reason I say this stuff is a bit hard to get the hang of. Once you get the hang of it, however, you probably won't forget how to do it properly. Segmented Bar Graph: The segmented bar graph is, in essence, a visualization of the conditional distribution of y on x. In statistics jargon, we always say make a graph or distribution of... on... or... vs..., or... to..., and the first set of dots will ALWAYS be the y variable and the x variable will ALWAYS come after the on, or vs, or to, etc. This may help you if that phrase is the only one you are given in the problem and need to determine which variable is response (y) and which is explanatory (x). -2-

10 Segmented bar graphs are composed of bars (the categories of the x variable) where each bar has segments (the categories of the y variable). All bars are the same height (1.0 or 100%), and the segmented bars are placed in an x-y axis. The x axis has a label (which is the x variable) and y axis has the label proportion or percent. Each bar has the x category label under it, and the y variable is stated with little boxes showing the legend of colors or textures used for each category of the y variable (i.e., the colors/textures of the bar segments). Finally, each graph must have a main title in it. Conclusion of the study: After all of this work you are now able to determine a conclusion. If your graph has bars that look approximately the same, then the two variables (x and y) are relatively independent. If one or more bars look different from the others, then the 2 variables are not independent. You might want to say that if the bars are not independent, why don't we just say that they are dependent, but, in statistics for this case, there are various levels and types of dependency, which you cannot know from just a 2-way-table, so even though it is clumsy English, it is more precise statistics-speak. Some might say that it should be the other way around that the different looking bars indicate independence or freedom. But, here is how I look at this concept, which makes sense to me why we do as we do on this. Let's say that I am a statistician working for you and you hired me to construct the study, along with 2-way-table, graph, etc. I do so and show it to you. You then ask What proportion of the first category of the y variable made up the second variable? If all of my bars on the segmented bar graph are about the same, then I can answer that question immediately, with the proportion of the desired category of y on any of the bars. I am independent to give you an answer. However, if I have bars with different segment lengths on them, then I cannot answer your question until I ask you What category of the x variable are you referring to?, since that will make my answer change depending on different categories. I am not independent to answer your question until YOU ANSWER MINE! Simpson's Paradox: A paradox is a seeming contradiction. There was a famous guard for the Utah Jazz who was inducted into the NBA hall of fame a while back. There was another player on a competing team who had a better career shooting percentage than John (the guard) had, but who was never even considered for the hall of fame. John, who was a famous 3 point shooter, had a much better 3 point shooting percentage, and a bit better 2 point shooting percentage than the other player. The paradox is how did the other player have an overall better shooting percentage than John did? This situation comes up a lot in 2-way-table statistics and is referred to as Simpson's Paradox. There is a marvelous diagram of this situation in the last WATCH OUT section of Ch 6 in your book. In essence, there is a confounding 3 rd variable, which gives different weighting to the proportion you are computing, which causes the paradox. In our NBA example, John took mostly 3 point shots and had a much better shooting percentage than -3-

11 the other player did. The other player took mostly 2 point shots, and although he had a slightly lower percentage of those score than John did, when you just do total makes divided by total attempted shots, the other player is higher, because John missed more of the difficult shots in his overall attempts. The key here is to determine what that 3 rd hidden variable is, the one which weights the proportion differently. In John's case, he had a higher percentage of more difficult shots, which dragged down his overall unweighted proportion. Here is another Simpson's paradox, which shows the problem of weighting. Say you and I take 5 courses one semester and I get 3 A's, 1 B and 1 C. You get 1 A, 2 B's and 2 C's. And you beat me in GPA. How is that possible when I got more A's by far than you? Answer: the weighting of the grades. My A's were in 1 credit lab courses and the C was in a 5 credit course, whereas your A was in a 5 credit course and C's were in 1 credit labs. Summary: So, as you study this chapter, do so slowly, methodically, thoughtfully, and do many problems and examples. Don't take these concepts too lightly or study too superficially. However, don't over think this stuff, either. Once you get it, then you will have it. -4-