Class Notes: Week 3. proficient

Transcription

1 Ronald Heck Class Notes: Week 3 1 Class Notes: Week 3 This week we will look a bit more into relationships between two variables using crosstabulation tables. Let s go back to the analysis of home language background in a subset of data and third grade reading proficiency from Week 2. You may try completing some of the analyses. English * proficient Crosstabulation English Total proficient Total Count Expected Count Count Expected Count Count Expected Count We can examine a number of relationships in a contingency table that we will make use of further in developing predictive models with categorical outcomes. The proportion of students who are proficient can be determined from the number proficient divided by the sample total (72/200 =0.36). First, we can calculate the odds of an event occurring. The odds of an event occurring is defined as follows: Odds, 1 where the Greek letter pi is used as the probability of the event of interest occurring in the population. In this case, the odds of being proficient will be the following: = We can use the odds to obtain the probability of being proficient as follows: odds odds This returns us to the proportion that is proficient in the sample (72/200 = 0.36). We can also obtain the odds ratio and risk estimates in the Statistics dialog box. The odds ratio is defined as the ratio of two odds. More specifically, we can compare the odds of people being proficient (72/200), broken down by whether they are English speaking (coded 1) [67/93=

2 Ronald Heck Class Notes: Week ] or non-english speaking (coded 0) [5/35 = ]. This can be understood as the odds ratio of being proficient for English speaking versus non-english speaking students: Odds ratio = This matches the coefficient for the odds ratio in the following table. Risk Estimate Value 95% Confidence Interval Lower Upper Odds Ratio for English (0 / 1) For cohort Prof = For cohort Prof = N of Valid Cases 200 We can also obtain the relative risk (or risk ratio) for each of the groups. The relative risk compares the probability of the event occurring (rather than the odds of it occurring) between the two groups. We can also compare the probability of being non-proficient by first group (non- English speaking) as 35/40 =.875, or the second group (English speaking) as 93/160 = The relative risk for the non-proficient group (0) will then be 0.875/ = We can also calculate the probably of being proficient (1) for the first group (non-english speaking) as 5/40 = 0.125, and for the second group (English speaking) as 67/160 = The relative risk coefficient is then 0.125/ = We can use the relative risk of being not proficient to proficient for determining the odds ratio (1.505/.299) as Specifying Models in Regression and GENLIN Next, we will begin specifying models using Regression and GENLIN in SPSS. We will look more specifically at some of the assumptions underlying various categorical models and how to find and use these programs in SPSS to examine categorical outcomes.

3 Ronald Heck Class Notes: Week 3 3 We start with the notion that statistical modeling depends on a family of probability distributions for outcome variables (Agresti, 2007). The term random variable describes the possible values that an outcome may have. A generalized linear model has the following (McCullagh & Nelder, 1989): 1) A probability distribution with an underlying random component or mathematical function that links a particular observed outcome obtained in a sample to the probability of its occurrence in a specific population, E(Y) = μ ; 2) A link function, g( ) which transforms the expected value of the outcome so that a linear model can be used to examine the relationship between the predictors and the transformed outcome (η); and 3) A structural model with defines the combination of covariates (continuous) and factors (categorical) that predict values of the transformed outcome. Let s suppose we wish to investigate a relationship between gender and probability of being proficient. There are 20 students in the class. When we arrange the data in a cross-tabulation table, we find the following: proficient * female Cross-tabulation Count Female Total proficient Total We can obtain the chi-square coefficient and other supporting tests. They suggest there is a relationship between gender and probability of being proficient. Chi-Square Tests Value Df Asymp. Sig. (2- sided) Pearson Chi-Square a Continuity Correction Likelihood Ratio Exact Sig. (2- sided) Exact Sig. (1- sided) Fisher's Exact Test Linear-by-Linear Association N of Valid Cases 20 a. 2 cells (50.0%) have expected count less than 5. The minimum expected count is Let s obtain the odds ratio of being proficient for females versus males from the table. For females the odds of being proficient are 10/1 (10). For males, the odds of being proficient are 4/5 (0.80). The odds ratio is then 10/0.80 or 12.5 (i.e., 12.5/1).

4 Ronald Heck Class Notes: Week 3 4 We can use ANALYZE: REGRESSION (Binary Logistic) to obtain results in a different format. Note if we specify female as categorical we will probably want to specify the first category as the reference group (males = 0). We can formulate a model as follows (note there is no separate error term, as in a typical regression model): η = log[π/(1-π)] = β 0 +β 1 female Variables in the Equation B S.E. Wald df Sig. Exp(B) Step 1 a female(1) Constant a. Variable(s) entered on step 1: female. For males, we can take the natural log of the probability ln(4/5) = This is the log odds (or logit) of being proficient if one is a male, which is the same as the intercept (or constant) log odds in the table, that is, the log odds when the other variables in the model are 0. [Note: Generally, ln or log is used to refer to the natural log, which is approximately ] For females, we can take the natural log of the probability ln(10/1) = This can also be obtained from the equation in the table above: = (female) We would say the predicted logit (2.303) favoring proficiency increases by units if one is female as opposed to male. Note: log odds coefficients are added. If we want the odds that females are proficient versus the odds males are proficient, we can take the natural log of the ratio of the odds of proficiency for females compared with males: ln(10/.8) = ln(12.5) = The odds ratio for females versus males can then be estimated as exp(b) = = This can be interpreted as the probability of being proficient for females versus males is increased by a factor of We can obtain the odds of females being proficient by multiplying the intercept odds (odds of proficiency for males) by the odds for females versus males (.800*12.5 = 10). This matches the odds females are proficient (10/1 = 10). Note: Odds ratios are multiplied. The probability that a female is proficient can be estimated from the predicted log odds (2.303) in the equation above. If we estimate the odds ratio [exp(b)] we obtain = We

5 Ronald Heck Class Notes: Week 3 5 can use the following formula [odds/(1+odds)] to obtain the probability females are proficient (10/11=.909). Other statistics include the -2 log likelihood (or deviance), which is the value of the likelihood function multiplied by -2 (so it will generally be positive). The Cox and Shell R and Nagelkerke R represent types of pseudo-r squares. These are not typical r-squares since they are not based on variance accounted for in Y. Model Summary Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square a a. Estimation terminated at iteration number 5 because parameter estimates changed by less than.001. We can also use GENLIN to obtain the results. WE can open ANALYZE: GENERALIZED LINEAR MODELS (Generalized Linear Models). We will obtain a screen with several different types of response variables (default = linear). We can specify binary logistic. We then select the Response variable (proficient) and select the reference category (first). We then open Predictors and select female (note we can place dichotomous variables either as factors or covariates). If we select female as categorical you will want to open Options and use the first category (male = 0) as the reference group. Then open Model and select female and place it in the Model box (note you could also build interactions here if there were several predictors). You can look at Estimation. The default is model based which is appropriate for small data sets. If you open the likelihood function and select kernel, you can see how the likelihood ratio test is used from model to model. Regarding estimates, where there is more data available we generally prefer robust estimates. In the Statistics tab, you can check Include exponential parameter estimates. This also provides you with a Likelihood ratio test (versus Wald chisquare test). We often prefer the Likelihood ratio tests for small samples (but you can leave the default on). Then you can run the model. We first receive some information to check whether the correct probability distribution and link function have been used. Model Information Dependent Variable proficient a Probability Distribution Binomial Link Function Logit a. The procedure models 1 as the response, treating 0 as the reference category. We also receive a variety of model fit information, some of which is more useful when comparing a series of models. In ML estimation, the model is fit by evaluating the likelihood of the population estimates given the observed estimates. The likelihood function describes that

6 Ronald Heck Class Notes: Week 3 6 discrepancy between the two sets of estimates (between 0 and 1), and we typically take the log of it. So, for example, if the value of the likelihood function is , the log likelihood is about the tabled value below (i.e., actually about ). Value df Value/df Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson Chi-Square Log Likelihood b Akaike's Information Criterion (AIC) Finite Sample Corrected AIC (AICC) Bayesian Information Criterion (BIC) Consistent AIC (CAIC) Model: (Intercept), female a. Information criteria are in small-is-better form. b. The kernel of the log likelihood function is displayed and used in computing information criteria. Finally, the estimates are presented. There is a scale parameter which results from the fact that there is no separate variance (since it is tied to the expected value, or mean). You can see some subtle differences in the output presented through the REGRESSION and GENLIN routines in SPSS. Parameter Estimates Parameter B Std. Error Hypothesis Test Exp(B) Wald Chi- Square df Sig. (Intercept) [female=1] [female=0] 0 a (Scale) 1 b Model: (Intercept), female a. Set to zero because this parameter is redundant. b. Fixed at the displayed value. Below, the likelihood ratio test suggests that the model with gender is better than a model with just the outcome. This compares the change in the model from the intercept only (i.e., no predictors are included in the model) to a model with one predictor. Omnibus Test a Likelihood Ratio Chi-Square df Sig Model: (Intercept), female a. Compares the fitted model against the intercept-only model.

7 Ronald Heck Class Notes: Week 3 7 Notice if we go back and estimate an intercept only model, we obtain the following fit information. Value df Value/df Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson Chi-Square Log Likelihood b Akaike's Information Criterion (AIC) Finite Sample Corrected AIC (AICC) Bayesian Information Criterion (BIC) Consistent AIC (CAIC) Model: (Intercept) a. Information criteria are in small-is-better form. b. The kernel of the log likelihood function is displayed and used in computing information criteria. The log likelihood of the intercept model is For the model with gender, it is The difference is , which when multiplied by -2 = 5.366, which is chi-square estimate for the Likelihood Ratio Test (with slight discrepancy due to rounding). Because the coefficient is larger than 3.84 (for 1 df, p <.05), we can conclude that the model with gender fits the data better than a model with just the intercept (as we might expect). As this comparison suggests, the GLM approach several advantages for examining relationships between variables than the more simplified cross-tabulation approach. References Agresti, A. (2007). An introduction to categorical data analysis. Hoboken, NJ: John Wiley & Sons, Inc. McCullagh, P. & Nelder, J. A. (1989). Generalized linear models (2 nd Edition). New York: Chapman & Hall.