DEPARTMENT OF HEALTH AND HUMAN SCIENCES HS900 RESEARCH METHODS


 Laurel Hancock
 2 years ago
1 DEPARTMENT OF HEALTH AND HUMAN SCIENCES HS900 RESEARCH METHODS Using SPSS Session 2 Topics addressed today: 1. Recoding data missing values, collapsing categories 2. Making a simple scale 3. Standardisation 4. Crosstabulations 5. Correlations 6. Comparing means 7. Selecting cases Accessing data from CMR Download the data3.sav file. Recoding and creating variables Missing values In the data we are using today, missing values are coded to negative values (usually 9 to 1). There are a number of reasons why some data may be missing nonresponse, proxy interviews and the particular question is not applicable to that respondent (i.e. asking marital status to someone under 16 years of age). Unless we are interested in analysing the missing data for nonresponse patterns, they are usually recoded to be system missing. This tells SPSS that there is no data for that variable for that particular respondent. Not all datasets are as systematically coded as this. Check your data very carefully to determine which values should be treated as missing. Before starting your analysis, change the output settings (Edit Options Output labels and change all options to either names and labels or values and labels). To see the recode work, run a Frequencies procedure on the variable ghqa (ghq: concentration). Note that there is one case for 9 missing or wild and 5 cases for 7 proxy respondent. Also look at the actual data in the Data View window and find a case with negative values. GHQA ghq: concentration Valid 9 missing or wild 7 proxy respondent 1 better thn usual 2 same as usual 3 less than usual 4 much less thn usual Total Cumulative Frequency Percent Valid Percent Percent
2 To code all values that are between 9 and 1 to system missing, choose the pull down menu Transform Recode Into Same Variables (Note: you must be in Data View or Variable View as the Transform pull down menu is not available in the Output window) Put all variables into the Variables box (similar to the Frequencies procedure) Click on Old and New Values In the new window In the Old Value details on the left of the window, choose the Range toggle and enter the values 9 and 1 In the New Value details on the right of the window, choose the Systemmissing toggle Then click on Add, and 9 thru 1 SYSMIS will be seen in the Old New box Click on Continue In the original window Click on OK Rerun the Frequencies procedure on the ghqa variable and note that the summary and the frequency table have changed to omit the 6 cases with missing data. Also, look again at your case(s) that previously had negative values and now there should be a. in the cell. Frequencies Statistics GHQA ghq: concentration N Valid Missing GHQA ghq: concentration Valid Missing Total 1 better thn usual 2 same as usual 3 less than usual 4 much less thn usual Total System Cumulative Frequency Percent Valid Percent Percent Collapsing a variable into groups For some analyses you may wish to divide the sample into groups according to respondents scores on some variable (e.g. to give low, medium, and high scoring groups). This requires a number of steps. To illustrate this process, we will break the continuous variable age into three groups with ages: youngest to 35, 36 to 55, and 56 to oldest. How can you find the youngest and oldest? The procedure is similar to that for missing values, but this time create new variables so the original ones are not lost. 2
3 Choose the pull down menu Transform Recode Into Different Variables Put the variable age into the Variables box The text age? appears in the box. This is the reminder to now enter the name of the new variable. In this case the name of the new variable is agecat. Type this in the Output variable Name box and label the variable age categories. Click on Change Click on Old and New Values In the new window In the Old Value details on the left of the window, choose the Range toggle and enter the values 16 and 35 In the New Value details on the right of the window, enter 1 Then click on Add, and 16 thru 35 1 will be seen in the Old New box Repeat the procedure for the other two age groups Click on Continue In the original window Click on OK Produce a frequency table for the variable agecat. Making a scale When you have several variables that can be combined into a measurement that assesses a more general concept, one technique of manipulating your data to carry out such a task is to create a scale. In our data there are 12 items (questions) from the General Health Questionnaire (GHQ). Remember last week I included the scale variable in the data. This week, you are going to construct the scale yourself. Look at your earlier output of the variable ghqa and you can see that the items are coded with four possible responses from 1 to 4 with higher scores indicating poorer mental health. The scale score of the GHQ is an additive scale derived by simply adding the scores from all 12 items together which will create a scale with a minimum value of 12 and a maximum value of 48. Why? To make the scale use the pull down menu Transform Compute In the Target Variable box type: ghqscale In the Numeric Expression box type: ghqa+ghqb+ghqc.+ghql ( include all 12 items) Run a Descriptives procedure to check the new variable. What is the minimum value, maximum value and mean? Why are there only 249 case with a scale score? You can label your new variable in the Variable View window. Collapsing a variable into (almost) equal groups Before we can divide up the GHQ scale scores into groups, we first need to inspect the distribution of scores determine the cut off points that will be used to divide the sample into 4 (almost) equally sized groups. We will then create a new variable (ghqcat) which will have four different values (1, 2, 3, and 4). 3
4 To determine the cutoff points, use the pull down menu for the Frequencies procedure: Analyze Descriptive Statistics Frequencies Select the ghqscale variable. Transfer it into the Variables box. Click on the Statistics button at the bottom of the screen. Click on Cut points for equal groups Put the number of groups you want in the box. (In this case, four). Click on Continue Back in the original Frequencies window, you can untoggle the Display frequency tables which will stop the long frequency table being displayed or leave it on if you wish! In the output from this procedure look for the headings labelled Percentiles. As you requested four equal groups, there will be THREE values: one for the 25 th, 50 th and 75 th percentile respectively. The values listed here will be used in the next step which involves creating a new variable which will have four values. The group with a new value of 1 will include any case with a score from 14 (check Descriptives output) to 19, the next (2) will contain scores from 20 to 21, the next (3) will contain scores from 22 to 25, and the highest group (4) will have scores of 26 to 40. Create the new variable (ghqcat) using the same procedure as you used to collapse age into groups above. (Don t forget to remove previous commands in the boxes). Do a frequency distribution of your new variable. (Don t forget to retoggle the display command). Why do the four categories have different numbers of cases? Standardisation of variables Sometimes it is useful to transform continuous variables into standardised variables. Standardising a variable transforms the raw scores into Z scores. Z scores are simply standard deviations of the original variable. Therefore, the new variable has a mean of zero and a standard deviation of one in a similar way to the standard normal curve. Can you find the way to do this using the Descriptives procedure? Standardise the ghqscale variable. How does SPSS tell you this is a standardised variable? Compare the results from the Descriptives procedure for the original and new variables. What is the relationship between the standard deviation of the original variable and the maximum value of the new variable? 4
5 Measures of Association Measures of association is a generic term for numerous statistics. There are too many to cover in this course so five have been chosen that cover common usage. The table below summarises the statistics and when they are to be used. dichotomous dichotomous nominal ordinal interval phi () nominal chisquared chisquared (2) (2) ordinal chisquared chisquared Spearman s (2) (2) rho () interval eta ()* eta ()* Spearman s rho () Pearson s r * providing the interval variable is dependent. Also see Comparing means. There are two main ways of obtaining these statistics crosstabulation and correlation. Crosstabulation (Crosstabs) Crosstabulations are the simplest way to determine a statistical association (measure of association) between two variables and work best with categorical variables. A crosstabulation is simply the number of cases in a category of one variable divided into the categories of another variable. The Crosstabs procedure produces a table that compares data groups from two different categorical variables. Using this table, we can ask SPSS to calculate statistics that give a measure of their association. (Remember your lesson on causality). Before beginning the Crosstabs procedure, take a few moments to think about your variables: specifically, the number of categories of each and the type of variable. These factors will impact the look of the table and the resulting statistics. Have at least 5 cases in each cell of the table. Keep the number of categories to a minimum. The table becomes unwieldy if it has too many categories. For a tall table, the variable with more categories should be in the rows section. For a wide table, the variable with more categories should be in the columns section. To create a crosstab, use the pull down menu Analyze Descriptive Statistics Crosstabs In the Crosstabs, select the variable(s) to include in your analysis Move the selected variables to the Row or Column box as appropriate In this example the variables sex and disabled are used to create a crosstab. It is convention to put the dependent variable (outcome), if clearly there is one, in the columns. While the actual counts in the cells tell us something of the distributions, most people find adding percentages more intuitive. If the outcome is in the columns, it is usual to add row percentages (i.e. the percentages add to 100% in each row) 5
6 and then the percentages are compared across the rows of the categories of the independent variable (factor) of choice. Percent across, compare down. Crosstabs Case Processing Summary SEX sex * DSBL registered disabled Cases Valid Missing Total N Percent N Percent N Percent % 2.8% % SEX sex * DSBL registered disabled Crosstabulation SEX sex Total 1 male 2 female Count % within SEX sex Count % within SEX sex Count % within SEX sex DSBL registered disabled 1 yes 2 no Total % 93.6% 100.0% % 97.9% 100.0% % 96.1% 100.0% In this example, 6.4% of men are registered disabled while 2.1% of women are registered disabled. Can you find how SPSS will calculate the statistics for a crosstab? Produce crosstabs and appropriate statistics for: o Sex / disabled o Marital status / smoker o Education / smoker o Social class / disabled Are there any problems with these tables? Correlation Correlations are used to describe the strength and direction of the linear relationship between two variables normally ordinal or interval. There are a number of different statistics available from SPSS, depending on the level of measurement, but we will use Spearman s rho when one of the variables is ordinal and Pearson s r when both are interval. (Pearson s r is robust to use for ordinal variables with more than 5 categories and a reasonably large N in the sample.) Spearman s rho is used as a measure of association between two ordinal level variables or between one ordinal and one interval level variables. Pearson s r (aka productmoment coefficient) is designed for interval level (continuous) variables. Pearson s r can only take on values from 1 to +1. The sign on the front indicates whether there is a positive correlation (as one variable increases, so too does the other) or a negative correlation (as one variable increases, the other decreases). The size of the absolute values (ignoring the sign) provides an indication of the strength of the relationship. A perfect correlation of 1 or +1 means that the value of one variable can be determined exactly by knowing the 6
7 value of the other variable. On the other hand a correlation of zero indicates no relationship between the two variables. Both Spearman s rho and Pearson s r procedures are accessed via the pull down menu Analyze Correlate Bivariate Put the variables you are interested in, in the Variables box Toggle the Correlation Coefficients options you want Produce appropriate correlation coefficients for: o Education / social class o Education / GHQ scale o Age / GHQ scale o Social class / GHQ scale Some guidelines on the strength of the association: r = ± 0.10 to ± 0.29 weak r = ± 0.30 to ± 0.49 moderate r = ± 0.50 to ± 1.0 strong Partial correlations SPSS will calculate two types of correlation for you. First, it will give you a simple bivariate correlation (just two variables as above), also known as a zero order correlation. SPSS will also allow you to explore the relationship between two (interval level) variables, while controlling for another variable. This is known as partial correlation. Try to produce the partial correlation of education and social class controlling for age. Is the partial correlation smaller or larger than the bivariate correlation? Eta and comparing means For the sake of completeness, the statistic eta was included in the table of measures of association. The statistic can be generated by SPSS under the Crosstabs procedure. However, usually when the interest is in differences in a continuous variable by categories of a nominal variable a difference of means test is used. Eta does have one advantage in that it does not assume linearity in the continuous variable. To compare means between two groups we use a ttest. For three or more groups, ANOVA (ANalysis Of VAriance) is appropriate but not covered in this lesson. Ttests There are a number of different types of ttests available in SPSS. The one we will discuss is the independent samples ttest, used when you want to compare the mean scores of two different groups of people or conditions. Independent samples ttest. For example, a research question may be: Is there a significant difference in the mean GHQ score between men and women? Let s see: Analyze Compare Means Independent Samples TTest Move the dependent (continuous) variable (ghqscale) into the Test Variable box Move the independent (categorical) variable (sex) into the Grouping Variable box Click on Define Groups and type in the numbers used in the data set to code each group  in the data file men = 1, women = 2, therefore in the Group 1 box, type 1 and in the Group 2 box type 2 7
8 TTest GHQSCALE SEX sex 1 male 2 female Group Statistics Std. Error N Mean Std. Deviation Mean Independent Samples Test GHQSCALE Equal variances assumed Equal variances not assumed Levene's Test for Equality of Variances F Sig. t df Sig. (2tailed) ttest for Equality of Means Mean Difference 95% Confidence Interval of the Std. Error Difference Difference Lower Upper Interpretation of the output In the Group Statistics box, SPSS gives you the mean and sd for each of your groups. It also gives you the number of people in each group. Always check these values first. Do they seem right? The first section of Independent Samples Test output box gives you the results of Levene s test for equality of variances. This tests whether the variance of the scores for the two groups is the same. The outcome of this test determines which of the tvalues that SPSS provides is the correct one to use. If the significance level (Sig.) of the Levene s test is larger than.05 (e.g..07,.10), you should use the ttest in the first line in the table, which refers to Equal variances assumed. If it is P=.05 or less (e.g..01,.001), this means that the variances for the two groups are not the same. Therefore your data violates the assumption of equal variance. SPSS provides you with an alternative tvalue which compensates for the fact that your variances are not the same. You should use the information in the second line of the ttest table that refers to Equal variances not assumed. If the value in the Sig (2tailed) column is equal or less than.05, then there is a significant difference in the mean scores on your dependent variable for each of the two groups. If the value is above.05, there is no significant difference between the two groups as in this case. 8
9 See if there is a difference in the mean value on the ghqscale variable between those who are married and those who are divorced. (Hint: adjust the Define Groups boxes.) Paired samples ttest There is another common ttest: the paired samples ttest, used when you want to compare the mean scores on the same group of people on two different occasions, or you have matched pairs. If you wish to see how this works, download data4.sav and have a go at looking at the two GHQ scores. You can this to the end of the session if you wish! Paired ttests (also referred to as repeated measures) are used when you have only one group of people and you collect data from them on two different occasions, or under two different conditions. Pretest and posttest experimental designs are an example of the type of situation where this technique is appropriate. You assess each person on some continuous measure at Time 1 and then at Time 2, after exposing them to some experimental manipulation or intervention. This approach is also used when you have matched pairs of subjects (that is, each person is matched with another on specific criteria such as age, sex etc.). One of the pair is exposed to Intervention 1 and the other is exposed to Intervention 2. Scores on a continuous measure are then compared for each pair. Paired sample ttests can also be used when you measure the same person in terms of her response on two different questions. In this case, both dimensions should be rated on the same scale. A word on null hypotheses Hypotheses are in the form of either a substantive hypothesis, which, as has been pointed out, represents the predictive association between variables, or a null hypothesis, which is a statistical artifice and always predicts the absence of a relationship between the variables. Hypothesis testing is based on the logic that the substantive hypothesis is tested by assuming that the null hypothesis is true. Testing the null hypothesis involves calculating how likely (the probability) the results were to have occurred is there really was no differences. Thus the onus of proof rests with the substantive hypothesis that there is a change or difference. The null hypothesis is compared with the research observations and statistical tests are used to estimate the probability of the observations occurring by chance (Bowling, 2002 p.169). Which brings us to P values.. All the statistics we have calculated (phi, chisquared etc) are tested to determine if they are statistically significant. This is usually done by comparing their value to point on an appropriate distribution determined by the statistic and the degrees of freedom. For example, the t distribution is a family of curves (in the same way as the normal curve is) and the shape of the curve is determined by the degrees of freedom. The value of the statistic is plotted (by SPSS!) against the relevant curve to determine the P value for that statistic. The most commonly used P value is below 0.05 (or 5%). This means that there is less than 5% chance of a false positive result. So, in the case of the independent ttest example above, we test the null hypothesis that there is no difference between the mean GHQ scores for men and women. From the output we see that the t statistic is with a P value of As we are looking for evidence to reject the null hypothesis we are looking for a P value of 0.05 or less. In this case the P value is well above 0.05 and so we have to accept the null hypothesis of no difference. 9
10 Selecting cases If you are interested in restricting your analysis to a certain group in the sample then you can select those cases before running your statistical procedures. For example, if you wanted the crosstabulation of marital status by disabled for women only then you can select women using the sex variable before running the Crosstab procedure. To select women: Make sure you have the Data window open From the pull down menu Data Select Cases (Note: the Data pull down menu not available in Output window) Toggle If condition is satisfied Click If In new window Type (or click) sex=2 in top box Note toggle for cases filtered or deleted. Make sure filtered as they can be unselected. Deleted cases cannot be unselected. Click Continue In original window Run Crosstab for marital status and disabled To unselect: Make sure you have the Data window open From the pull down menu Data Select Cases (Note: the Data pull down menu not available in Output window) Toggle All cases Splitting the file Sometimes when doing research you may want to compare the strength of the relationships for two separate groups. For example, you might have reason to believe that the correlation between education and GHQ might be different for men compared to women. We can do this by splitting the file. The example that follows will compare the correlation coefficients for two groups of subjects men and women. This command will spit the sample by sex and repeat any analyses that follow for these two groups separately. To split the sample: Make sure you have the Data window open From the pull down menu Data Split File (Note: the Data pull down menu not available in Output window) Toggle on Compare Groups Move the grouping variable (sex) into the box labelled Groups based on Follow the steps given earlier to request the correlation between education and GHQ. Important: If you look now at the bottom right hand corner of your screen, you will see Split File On. Until you take the Split File option off, all analyses will be produced separately for the two groups. To turn it off: Data Split File Analyze all cases, do not create groups 10
SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES
SCHOOL OF HEALTH AND HUMAN SCIENCES Using SPSS Topics addressed today: 1. Differences between groups 2. Graphing Use the s4data.sav file for the first part of this session. DON T FORGET TO RECODE YOUR
More information