Recitation, Week 3: Basic Descriptive Statistics and Measures of Central Tendency:


 Vincent Brooks
 2 years ago
1 Recitation, Week 3: Basic Descriptive Statistics and Measures of Central Tendency: 1. What does Healey mean by data reduction? a. Data reduction involves using a few numbers to summarize the distribution of a variable, or an array of data as he calls it. 2. What is the problem with using only a few numbers to summarize the distribution of a variable? a. Summarizing a distribution involves using the mean, denoted x, or standard deviation, denotedσ, to describe the variable. This inevitably leads to a loss of information (precision and detail). 3. When analyzing descriptive statistics, it is best to describe the data in terms of percentages as opposed to using the frequency count. Comparisons are difficult to conceptualize as raw frequencies. a. EXAMPLE: Instead of saying 20 out of 100 students got 4. on the exam, say 20% of students got 4. on the exam. 4. What is the difference between percentage and proportion? A percentage is a proportion multiplies by What is a measure of central tendency? a. It is a way to summarize the distribution to give you an idea about the typical case of that distribution, in other words, the center of it. b. There are three measures of central tendency i. The mean: describes the typical score ii. The mode: describes the most recurring score 1. Only used with nominal variables iii. The median: is the 50 th Percentile of the distribution 1. A median is a special case of a percentile, which is the percentage of cases below which a specific percentage of cases fall. c. How does the median differ from the mode and the mean? Unlike the mode or the mean, the always represents the exact center of a distribution
2 of scores, meaning that 50% of the cases always fall above the median and 50% of the cases always fall below the median. d. Characteristics of the mean i. The mean is always the center of any distribution. The mean is the point around which all of the scores cancel out. Mathematically, this says that if I subtract the mean from each value and sum the results, the resulting sum will be equal to 0. ii. The mean may often be very misleading because it is sensitive to all observations whereas the median is not. In fact, the median is less sensitive to extreme observations and therefore it is often better to report the median. 1. To illustrate this, consider the familiar normal or bell curve. This is a symmetric distribution because there are as many values on the left as there are on the right of the center. Many natural phenomena have normal distributions, such as weight, height, etc. 2. There are important distributions that are not symmetric. When a distribution is not symmetric, it is skewed. There are two types of skewed distributions, right skewed and left skewed. 3. EXAMPLE of RIGHT SKEWED: Income. Often it is better to report the median than the mean, since the mean is misleading in extreme cases. a. EXAMPLE. Consider the following summary of AGE. Notice that the arithmetic mean is somewhat greater than the median. The reason is that the distribution is right skewed. If the mean is larger than the median the distribution is skewed. Statistics AGE OF RESPONDENT N Valid Missing Mean Median
3 To see this, create a histogram of the age variable Std. Dev = Mean = 44.9 N = AGE OF RESPONDENT 6. What is a measure of dispersion? a. Measures of Central Tendency don t tell anything about how much the data values differ from each other. i. EXAMPLE: What is the mean of the following two distributions of AGE? ii. The distributions are obviously very different. b. Measures of dispersion or variability attempt to quantify the spread of observations. c. It is a measure of variability, usually defined in terms of variability around the mean. d. The distance between the individual score and the mean value, mathematically this is ( X i X ). e. The larger the distance from the mean, the larger the deviation will be.
4 f. If the scores were clustered around the mean, the less variability there will be. i. PRACTICAL EXAMPLE: Let s assume that average income for people with PhD s is $55,000 and average income for people with a high school education is $20,000. Since opportunities for people with merely a HS education are less than those with PhD s most people who only have a HS education would make somewhere aroung 20K, there is not much variation. However, it is possible for PhDs to make anywhere from $20K to $800K per year and hence there is much more variation around the average salary for PhDs than there is for HS graduates. 7. USING SPSS to Produce Measures of Dispersion a. Use Descriptives to find the range and standard deviation for age, educ and tvhours b. To reproduce this output first open the gss98randsamp.save c. Go to the familiar Analyze! Descriptive Statistics! Frequencies d. Put the variables corresponding to age, educ and tvhours in the box labeled Variable(s) e. Click the Statistics button and check mean, minimum, maximum and standard deviation
5 f. Click continue, then OK g. Notice that the chart in the book looks a little bit different, so lets transpose the rows and columns to make it look like Healey. h. Double click on the output window i. Then from the menu select Pivot! Transpose Rows and Columns, you should get what is shown below.
6 Statistics AGE OF RESPONDENT HIGHEST YEAR OF SCHOOL COMPLETED HOURS PER DAY WATCHING TV N Valid Missing Mean Std. Deviation Minimum Maximum j. How do we interpret these results? What is 1 standard deviation above the mean for the variable tvhours? 8. USING THE COMPUTE COMMAND to create an Attitude towards abortion scale a. There are two distinct measures on attitudes toward abortion in the 1998 GSS survey. One variable, abany, asks the respondent to state whether they believe that abortion should be allowed for any reason. The other, abhlth, measure whether they feel abortion should only be allowed to preserve the health of the woman. b. We want to create a summary measure that gives us an overall measure of attitude toward abortion. c. We must know something about the data. If response on abany is 1 then the person was in favor of abortion for any reason. If the value in the dataset is 2 then the person was opposed. Similar thing for abhlth, 1 = in favor of abortion if health is at stake, 2 = not in favor. d. We want an overall measure of antiabortion position. So we will sum the variables. If the person was in favor of both, then our new variable will have a value of 2 (1+1). In favor of one and not the other gives a value of 3 (2+1 or 1+2). If a person in completely against abortion, the value is going to be 4. e. We need to use the Compute command. To open the Compute Variable dialog box from the menus choose: Transform! Compute
7 f. In the Compute Variable Dialog box, type abscale, which will represent the variable we are creating. g. Click the button Type and Label and type Abortion Scale, then Continue
8 h. Select (or type) abany in the variable list and move it into the Numeric Expression box. Then type + and then abhlth and OK. i. Get the frequency distribution of each variable. Statistics ABORTION IF WOMAN WANTS FOR ANY REASON WOMANS HEALTH SERIOUSLY ENDANGERED Abortion Scale N Valid Missing Mean Std. Deviation Minimum Maximum ABORTION IF WOMAN WANTS FOR ANY REASON Valid Missing YES NO NAP DK NA Cumulative Frequency Percent Valid Percent Percent
9 WOMANS HEALTH SERIOUSLY ENDANGERED Valid Missing YES NO NAP DK NA Cumulative Frequency Percent Valid Percent Percent Valid Missing System Abortion Scale Cumulative Frequency Percent Valid Percent Percent Note: for category 3.00, we know these are the situations where the person approved in one situation but not in the other, but we do not know which situation they approved. It seems reasonable that they approved when life of the mother was at stake but not for any reason, but we would have to use other procedures to find that out. SPSS companion exercises 2.5, but choose 1 variable from world.sav, recode it, get frequency distributions for the variable, and summarize the results
