Multiple Regression Kean University February 12,

Transcription

1 Multiple Regression Kean University February 12,

2 Content 1. Multiple Linear Regression.2 2. Logistic Regression.8 3. Statistical Definitions Regression Models & SEM

3 Multiple Linear Regression Regressions techniques are primarily used in order to create an equation which can be used to predict values of dependent variables for all members of the population. A secondary function of using regression is that it can be used as a means of explaining causal relationships between variables. Types of Linear Regression Standard Multiple Regression-All independent variables are entered into the analysis simultaneously Sequential Multiple Regression (Hierarchical Multiple Regression)-Independent variables are entered into the equation in a particular order as decided by the researcher Stepwise Multiple Regression-Typically used as an exploratory analysis, and used with large sets of predictors 1. Forward Selection-Bivariate correlations between all the IVs and the DV are calculated, and IVs are entered into the equation from the strongest correlate to the smallest 2. Stepwise Selection-Similar to forward selection; however, if in combination with other predictors and the IV no longer appears to contribute much to the equation, it is thrown out. 3. Backward Deletion-All IVs are entered into the equation. Partial F tests are calculated on each variable as if it were entered last, to determine the level of contribution to the overall prediction. The smallest partial F is removed based on a predetermined criteria. Variables IV- Also referred to as predictor variables, one or more continuous variables DV-Also referred to as the outcome variable, a single continuous variable Assumptions that must be met: 1. Normality. All errors should to be normally distributed, which can be tested by looking at the skewness, kurtosis, and histogram plots. Technically normality is necessary only for the t-tests to be valid, estimation of the coefficients only requires that the errors be identically and independently distributed 2. Independence. The errors associated with one observation are not correlated with the errors of any other observation. 3. Linearity. The relationship between the IVs and DV should be linear. 4. Homoscedasticity. The variances of the residuals across all levels of the IVs should be consistent, which can be tested by plotting the residuals. 5. Model specification. The model should be properly specified (including all relevant variables, and excluding irrelevant variables) Other important issues: Influence - individual observations that exert undue influence on the coefficients. Are there covariates that you should be including in your model? Collinearity - The predictor variables should be related, but not so strongly correlated that they are measuring the same thing (e.g. using age and grade), which will lead to multicollinearity. Multicollinearity misleadingly 2

4 inflates the standard errors. Thus, it makes some variables statistically insignificant while they should be otherwise significant. Unusual and Influential data A single observation that is substantially different from all other observations can make a large difference in the results of your regression analysis. If a single observation (or small group of observations) substantially changes your results, you would want to know about this and investigate further. There are three ways that an observation can be unusual. Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem. Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an observation deviates from the mean of that variable. These leverage points can have an unusually large effect on the estimate of regression coefficients. Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness. Collinearity Diagnostics VIF: Formally, variance inflation factors (VIF) measure how much the variance of the estimated coefficients are increased over the case of no correlation among the X variables. If no two X variables are correlated, then all the VIFs will be 1. If there are two or more variables that will have a VIF around or greater than 5 (some say up to 10 is okay), one of these variables must be removed from the regression model. To determine the best one to remove, remove each one individually. Select the regression equation that explains the most variance (R 2 the highest). Tolerance: Value should be greater than.10. Less than.10 indicates a collinearity issue. Other informal signs of multicollinearity are: Regression coefficients change drastically when adding or deleting an X variable. A regression coefficient is negative when theoretically Y should increase with increasing values of that X variable, or the regression coefficient is positive when theoretically Y should decrease with increasing values of that X variable. None of the individual coefficients has a significant t statistic, but the overall F test for fit is significant. A regression coefficient has a nonsignificant t statistic, even though on theoretical grounds that X variable should provide substantial information about Y. High pairwise correlations between the X variables. (But three or more X variables can be multicollinear together without having high pairwise correlations.) How to deal with multicollinearity: Increasing the sample size is a common first step, but only partially offsets the. The easiest solution: Remove the most intercorrelated variable(s) from analysis. This method is misguided if the variables were there due to the theory of the model. Combine variables into a composite variable through building indexes. Remember: in order to create an index, you need to have theoretical and empirical reasons to justify this action. Use centering: transform the offending independents by subtracting the mean from each case. 3

5 Drop the intercorrelated variables from analysis but substitute their crossproduct as an interaction term, or in some other way combine the intercorrelated variables. This is equivalent to respecifying the model by conceptualizing the correlated variables as indicators of a single latent variable. Note: if a correlated variable is a dummy variable, other dummies in that set should also be included in the combined variable in order to keep the set of dummies conceptually together. Leave one intercorrelated variable as is but then remove the variance in its covariates by regressing them on that variable and using the residuals. Assign the common variance to each of the covariates by some probably arbitrary procedure. Treat the common variance as a separate variable and decontaminate each covariate by regressing them on the others and using the residuals. That is, analyze the common variance as a separate variable. To Conduct the Analysis in SPSS 1. Your dataset must be open. To run the analysis, click analyze, then regression, then linear. 2. The Linear Regression window will open. Select the outcome variable, then the right arrow to put the variable in the dependent variable box. Highlight all of the independent variables, then the right arrow to put the variables into the Independent(s) box. 2. Place DV (outcome) in Dependent Box 2. Place IV s(predictors) in Independent Box 3. Select the method of regression that is most appropriate for the data set: a. Enter-enters all IVs, one at a time, into the model regardless of significant contribution 3. Select the appropriate b. Stepwise-combines Forward and regression method Backward methods, and utilizes criteria for both entering and removing IVs in the equation c. Remove-first uses and Enter method. Specific variable(s) is removed from the model and the Enter method is repeated d. Backward-enters all IVs one at a time and then removes then removes them one at a time based on a predetermined level of significance for removal (default is p.01). e. Forward-only enters IVs that significantly contribute to the model. 4

6 4. Click on the Statistics button to open the Statistics Dialogue box. Check the appropriate statistics, which usually includes Estimates, Model Fit, Descriptives, Part and Partial Correlations, Collinearity diagnostics. Note: if running a stepwise regression, check, R squared change. 4. Select the appropriate statistics 5. Click on the Options button to open the Options Dialogue box. Here you can change the inclusion and exclusion criteria depending on the method of regression used. 5. If applicable, change the probability or F value for inclusion or exclusion 6. Optional, if needed, click on the Plots button to add Plots and Histograms to the output. Also, clicking the save button gives options to save the residuals, etc. 7. To create Syntax file, simply click on Paste. Output Run the syntax and Output file should look similar to those below: The variables entered box shows which variables have be included or excluded from the regression analysis, and the method in which they have been entered. Depending on the method of regression used, certain variables maybe removed for failing to meet predetermined criteria. 5

7 R Square indicates the degree to which the amount of variance in the DV explained by the IVs. Use the adjusted R Square when you have more than 1 predictor (IV). The model summary box outlines the overall fit of the model. R is the correlation between the variables, which should be the same as shown in the Correlations table. R Square value indicates the amount of variance in the dependent variable by the predictor variables. In this case, the predictor variables account for 9.8% of the variance in number of offenses. The adjusted R Square is a more conservative estimate of variance explained and removes variability that is likely due to chance; however, this value is not often reported or interpreted. The ANOVA is used to test whether or not the model significantly predicts the outcome variable. In this example, the model does significantly predicts the outcome variable, because p <.001. The F ratio and significance tells the degree to which the model predicts the DV The Coefficients box notes the degree and significance that each predictor has on the outcome variable. In this example, only whether or not one is incarcerated and their entry age are significant predictors. When conducting regression analyses, it may be useful to run multiple combinations of predictor variables and regression methods. Unstandardized B is the degree to which each predictor variable impacts the outcome variable (DV) Standardized Betas tell the amount of variance of the DV that is explained by each predictor variable individually The T statistic and Significance tell whether or not each predictor is significant 6

8 Sample Write Up & Table A multiple regression was also conducted to predict the number of offenses based on the available independent variables. The predictors included incarcerated (vs. not incarcerated), the age at first offense, the number of days in placement, and race. The overall model was significant, F (5, 220) = 4.80, p <.001, and accounted for 31% of the variance. The results indicated that incarcerated and the age at first offense were significant predictors of the number of offenses committed (see Table 9). The number of days in placement and race were not significant predictors of the number of offenses committed. Incarceration (compared to non-incarceration) was associated with an increase in the number of offenses committed (Beta =.243, p <.01). In addition, controlling for the other predictors, as age at first offense increased, the number of offenses also increased (Beta =.187, p <.01). Table 9 Multiple Regression Analyses of Incarceration Status, Age, Days in Placement, and Race on Number of Offenses (N = 226) Unstandardized B SE Beta t p Incarcerated Age at First Offense Days in Placement African American Caucasian Note. F (5, 220) = 4.80, p <.001, R 2 =.098 7

9 Logistic Regression Logistic regression, also referred to as binary or multinomial regression, is a type of prediction analysis that predicts a dichotomous dependent variable based on a set of independent variables. Variables DV-one dichotomous dependent variable (e.g. alive/dead, married/single, purchase/not purchase) IVs-one or more independent variable that can be either continuous or categorical Assumptions that must be met: 1. Sample Size. Reducing a continuous variable to a binary or categorical one loses information and attenuates effect sizes, reducing the power of the logistic procedure. Therefore, in many cases, a larger sample size is needed to insure power of the statistical procedure. It is recommended that a sample size be at least 30 times the number of parameters or 10 cases per independent variable. 2. Meaningful coding. Logistic coefficients will be difficult to interpret if not coded meaningfully. The convention for binary logistic regression is to code the dependent class of greatest interest as 1 ("the event occurring") and the other class as 0 ("the event not occurring). 3. Proper specification of the model is particularly crucial; parameters may change magnitude and even direction when variables are added to or removed from the model. a. Inclusion of all relevant variables in the regression model: If relevant variables are omitted, the common variance they share with included variables may be wrongly attributed to those variables, or the error term may be inflated. b. Exclusion of all irrelevant variables: If causally irrelevant variables are included in the model, the common variance they share with included variables may be wrongly attributed to the irrelevant variables. The more the correlation of the irrelevant variable(s) with other independents, the greater the standard errors of the regression coefficients for these independents. 4. Linearity. Logistic regression does not require linear relationships between the independent factor or covariates and the dependent, but it does assume a linear relationship between the independents and the log odds (logit) of the dependent. a. Box-Tidwell Transformation (Test): Add to the logistic model interaction terms which are the crossproduct of each independent times its natural logarithm. If these terms are significant, then there is nonlinearity in the logit. This method is not sensitive to small nonlinearities. 5. No outliers. As in logistic regression, outliers can affect results significantly. Types of Logistic Regression Binary Logistic Regression-treats all IVs as continuous covariates and categorical variables must be set in SPSS Multinomial Logistic Regression- All IVs are explicitly entered as factors, and the reference category of the outcome variable must be set in SPSS. 8

10 To conduct this analysis in SPSS 1. Your data set must be open. To run the analysis, click Analyze, then Regression, then either Binary or Multinomial Logistic. 2. Move the DV into the Dependent box, and move the IVs into the Covariates Box. 2. Move the DV to the Dependent Box 2. Move the IVs to the Covariates Box 3. Select the appropriate inclusion method 4. Check this box to display the confidence interval 3. Select the method of regression that is most appropriate for the data set: a. Enter-enters all IVs, one at a time, into the model regardless of significant contribution b. Stepwise-combines Forward and Backward methods, and utilizes criteria for both entering and removing IVs in the equation c. Remove-first uses and Enter method. Specific variable(s) is removed from the model and the Enter method is repeated d. Backward-enters all IVs one at a time and then removes then removes them one at a time based on a predetermined level of significance for removal (default is p.01). e. Forward-only enters IVs that significantly contribute to the model. 4. Click on the Options box, and check the box next to CI for exp(b). Then click Continue. 9

11 5. Paste and run the syntax. Report the number of cases included in the analysis Output Run the syntax, and Output file should look similar to those below: The first box outlines how many cases were included and excluded from the analysis, you will report the n included in the analysis in your write up. The dependent variable encoding box shows the label for each coding. This is important to note, because SPSS creates the regression equation based on the likelihood of having a value of 1. In this case, SPSS is creating an equation to predict the likelihood that an individual is not very satisfied. The next set of tables falls under the heading of Block 0: Beginning Block, and consist of three tables: Classification Table, Variables in the Equation, and Variables not in the Equation. This block provides a description of the null model and does not include the predictor variables. These values are not interpreted or report. The Beginning Block does not include the predictor variables. DO NOT report or interpret these values. Block 1 is what is interpreted and reported in the write up. -2 Log is not interpreted or reported. The Omnibus Test uses a Chi-Square to determine if the model is statistically significant. Cox & Snell R Square and Nagelkerke R Square are measures of effect size. Typically Nagelkerke is reported over Cox & Snell. Based on the regression equation created from the analysis, SPSS will predicts which group individual cases will belong to. SPSS then calculates the percentage of correct predictions. 10

12 The estimated coefficient and standard error are reported, but not interpreted Report significance of each predictor Report the confidence intervals for both the lower and upper limits Exp(B) is the odds ratio for each predictor. As mentioned, SPSS is predicting the likelihood of the DV being a 1, in this case Not Very Satisfied. When the odds ratio is less than 1, increasing values in the variable correspond to decreasing odds of the event s occurrence. When the odds ratio is greater than 1, increasing values of the variable corresponds to increasing odds of the event s occurrence. Sample Write Up and Table A logistic regression analysis was conducted to predict if an individual was not very satisfied with his or her job, see Table 1. Overall, the model was significant, χ 2 (4) = 16.71, p =.002, Nagelkerke R 2 =.025. Of all the predictor variables, only age was a significant predictor, p <.001, and had an odds ratio of.980, indicating that as an individual s age increases he or she is less likely to be not very satisfied with his or her job. As a predictor, years of education was marginally significant, p =.085, and had an odds ratio of.957, indicating that as years of education increases, there is a decrease in the likelihood of being not very satisfied with one s job. None of the remaining predictors (e.g., hours worked per week, number of siblings) were not significant predictors of job satisfaction, ns. Table 1 Summary of Logistic Regression Predicting Job Satisfaction Satisfied or Not Satisfied Odds 95% CI β Ratio Lower Upper p Age Years of Education Hours per Week Number of Siblings Note.: χ 2 (4) = 16.71, p =.002, Nagelkerke R 2 =

13 Binary (dichotomous) variable Statistics Definitions A binary variable has only two values, typically 0 or 1. Similarly, a dichotomous variable is a categorical variable with only two values. Examples include success or failure; male or female; alive or dead. Categorical variable A variable that can be placed into separate categories based on some characteristic or attribute. Also referred to as qualitative, discrete, or nominal variables. Examples include gender, drug treatments, race or ethnicity, disease subtypes, dosage level. Causal relationship A causal relationship is one in which a change in one variable can be attributed to a change in another variable. The study needs to be designed in a way that it is legitimate to infer cause. In most cases, the term causal conclusion indicates findings from an experiment in which the subjects are randomly assigned to a control or experimental group. For instance, causality cannot be determined from a correlational research design. Furthermore, it is important to note that a significant finding (small p-value) does not signify causality. The medical statistician, Austin B. Hill, outlined nine criteria to establish causality in epidemiological research: temporal relationship, strength, dose-response relationship, plausibility, consideration of alternate explanations, experiment, specificity, and coherence. Central Limit Theorem The Central Limit Theorem is the foundation for many statistical techniques. The theorem proposes that the larger the sample size (>30), the more closely the sampling distribution of the mean will approach a normal distribution. The mean of the sampling distribution of the mean will approach the true population mean have a standard deviation of σ / n (population variance / square root of n). The population from which the sample is drawn does not need to be normally distributed. Furthermore, the Central Limit Theorem explains that the approximation improves with larger samples as well as why sampling error is smaller with larger samples than it is with smaller samples. Confidence interval A confidence interval is an interval estimate of a population parameter, consisting of a range of values bounded by upper and lower confidence limits. The parameter is estimated as falling somewhere between the two values. Researchers can assign a degree of confidence for the interval estimate (typically 90%, 95%, 99%), indicating that the interval will include the population parameter that percentage of the time (i.e., 90%, 95%, 99%). The wider the confidence interval, the higher the confidence level. Confounding variable A confounding variable is one that obscures the effects of another variable. In other words, a confounding variable is one that is associated with both the independent and dependent (outcome) variables, and therefore affects the results. Confounding variables are also called extraneous variables and are problematic because the researcher cannot be sure if results are due to the independent variable, the confounding variable, or both. Smoking, for instance, would be a confounding variable in the relationship between drinking alcohol and lung cancer. Therefore, a researcher studying the relationship between alcohol consumption and lung cancer should control for the effects of smoking. A positive confounder is related to the independent and dependent variables in the same direction; a negative confounder displays an opposite relationship to the two variables. If there is a confounding effect, researchers can use a stratified sample and/or a statistical model that controls for the confounding variable (e.g., multiple regression, analysis of covariance). 12

14 Continuous variables A variable that can take on any value within the limits the variable ranges. Continuous variables are measured on ratio or interval scales. Examples include age, temperature, height, and weight. Control group In experimental research and many types of clinical trials, the control group is the group of participants that does not receive the treatment. The control group is used for comparison and is treated exactly like the experimental group except that it does not receive the experimental treatment. In many clinical trials, one group of patients will be given an experimental drug or treatment, while the control group is given either a standard treatment for the illness or a placebo (ex: sugar pill). Covariate A covariate is a variable that is statistically controlled for using techniques such as multiple regression analysis or analysis of covariance. Covariates are also known as control variables and in general, have a linear relationship with the dependent variable. Using covariates in analyses allow the researcher to produce more precise estimates of the effect of the independent variable of interest. In order to determine if the use of a covariate is legitimate, the effect of the covariate on the residual (error) variance should be examined. If the covariate reduces the error, then is likely to improve the analysis. Degrees of freedom The degrees of freedom is usually abbreviated df and represents the number of values free to vary when calculating a statistic. For instance, the degrees of freedom in a 2x2 crosstab table are calculated by multiplying the number of rows minus 1 by the number of columns minus 1. Therefore, if the totals are fixed, only one of the four cell counts is free to vary, and the df = (2-1) (2-1) = 1. Dependent variable The dependent variable is the effect of interest that is measured in the study. It is termed the dependent variable because it depends on another variable. Also referred to as outcome variables or criterion variables. Descriptive / Inferential Statistics Descriptive Statistics. Descriptive statistics provide a summary of the available data. The descriptive statistics are used to simplify large amounts of data by summarizing, organizing, and graphing quantitative information. Typical descriptive statistics include measures of central tendency (mean, median, mode) and measures of variability or spread (range, standard deviation, variance). Inferential Statistics. Inferential statistics allow researchers to draw conclusions or inferences from the data. Typically, inferential statistics are used to make inferences or claims about a population based on a sample drawn from that population. Examples include independent t tests and Analysis of Variance (ANOVA) techniques. Effect size An effect size is a measure of the strength of the relationship between two variables. Sample-based effect sizes are distinguished from test statistics used in hypothesis testing, in that they estimate the strength of an apparent relationship, rather than assigning a significance level reflecting whether the relationship could be due to chance. The effect size does not determine the significance level, or vice-versa. Some fields using effect sizes apply words such as "small", "medium" and "large" to the size of the effect. Whether an effect size should be interpreted small, medium, or big depends on its substantial context and its operational definition. Some common measures of effect size are Cohen s D, Cramer s V, Odds Ratios, Standardized Beta weights, Pearson s R, and partial Eta squared. 13

15 Independent variable The independent variables are typically controlled or manipulated by the researcher. Independent variables are also used to predict the values of another variable. Furthermore, researchers often use demographic variables (e.g., gender, race, age) as independent variables in statistical analysis. Examples of independent variables include the treatment given to groups, dosage level of an experimental drug, gender, and race. Measures of Central Tendency: Mean, Median, Mode Measures of central tendency are a way of summarizing data using the value which is most typical or representative, including the mean, median, and mode. Mean. The mean (strictly speaking arithmetic mean) is also known as the average. It is calculated by adding up the values for each case and dividing by the total number of cases. It is often symbolized by M or X ( Xbar ). The mean is influenced by outliers and also should not be used with skewed distributions. Median. The median is the central value of a set of values, ranked in ascending (or descending) order. Since 50% of all scores fall at or below the 50th percentile, the median is therefore the score located at the 50th percentile. The median is not influenced by extreme scores and is the preferred measure of central tendency for a skewed distribution. Mode. The mode is the value which occurs most frequently in a set of scores. The mode is not influenced by extreme values. Measures of Dispersion: Variance, Standard deviation, range Measures of dispersion include statistics that show the amount of variation or spread in the scores, or values of, a variable. Widely scattered or variable data results in large measures of dispersion, whereas tightly clustered data results in small measures of dispersion. Commonly used measures of dispersion include the variance and the standard deviation. Variance. A measure of the amount of variability in a set of scores. Variance is calculated as the square of the standard deviation of scores. Larger values for the variance indicate that individual cases are further away from the mean and a wider distribution. Smaller variances indicate that individual cases are closer to the mean and a tighter distribution. The population variance is symbolized by σ 2 and the sample variance is symbolized by s 2. Standard Deviation. A measure of spread or dispersion in a set of scores. The standard deviation is the square root of the variance. Similar to the variance, the more widely the scores are spread out, the larger the standard deviation. Unlike the variance, which is expressed in squared units of measurement, the standard deviation is expressed in the same units as the measurements of the original data. In the event that the standard deviation is greater than the mean, the mean would be deemed inappropriate as a representative measure of central tendency. The empirical rule states that for normal distributions, approximately 68% of the distribution falls within ± 1 standard deviation of the mean, 95% of the distribution falls within ± 2 standard deviations of the mean, and 99.7% of the distribution falls within ± 3 standard deviations of the mean. The standard deviation is symbolized by SD or s. Multivariate / Bivariate / Univariate Multivariate. Quantitative methods for examining multiple variables at the same time. For instance, designs that involve two or more independent variables and two or more dependent variables would use multivariate analytic techniques. Examples include multiple regression analysis, MANOVA, factor analysis, and discriminant analysis. Bivariate. Quantitative methods that involve two variables. Univariate. Methods that involve only one variable. Often used to refer to techniques in which there is only one outcome or dependent variable. 14

16 Normal distribution The normal distribution is a bell-shaped, theoretical continuous probability distribution. The horizontal axis represents all possible values of a variable and the vertical axis represents the probability of those values. The scores on the variable are clustered around the mean in a symmetrical, unimodal fashion. The mean, median, and mode are all the same in the normal distribution. The normal distribution is widely used in statistical inference. Null / Alternative Hypotheses Null Hypothesis. In general, the null hypothesis (H 0 ) is a statement of no effect. The null hypothesis is set up under the assumption that it is true, and is therefore tested for rejection. Alternative Hypothesis. The hypothesis alternative to the one being tested (i.e., the alternative to the null hypothesis. The alternative hypothesis is denoted by H a or H 1 and is also known as the research or experimental hypothesis. Rejecting the null hypothesis (on the basis of some statistical test) indicates that the alternative hypothesis may be true. Parametric / non-parametric Parametric Statistics: Statistical techniques that require the data to have certain characteristics (approximately normally distributed, interval/ ratio scale of measurement). Also called inferential statistics. Non-Parametric Statistics: Statistical techniques designed for use when the data does not conform to the characteristics required for parametric tests. Non-parametric statistics are also known as distribution-free statistics. Examples include the Mann-Whitney U test, Kruskal-Wallis test and Wilcoxon's (T) test. In general, parametric tests are more robust, more complicated to compute, and have greater power efficiency. Population / Sample The population is the group of persons or objects that the researcher is interested in studying. To generalize about a population, the researcher studies a sample that is representative of the population. Power The power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is false (i.e. that it will not make a Type II error, or a false negative decision). As the power increases, the chances of a Type II error occurring decrease. Power analysis can be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given size. Power analysis can also be used to calculate the minimum effect size that is likely to be detected in a study using a given sample size. In addition, the concept of power is used to make comparisons between different statistical testing procedures: for example, between a parametric and a nonparametric test of the same hypothesis. p-value The p-value stands for probability value and represents the likelihood that a result is due to chance alone. More specifically, the p-value is the probability of obtaining a result at least as extreme as the one that was actually observed given that the null hypothesis is true. For instance, given a p -value of 0.05 (1/20) and repeated experiments, we would expect that in approximately every 20 replications of the experiment, there would be one in which the relationship between the variables would be equal to or more extreme than what was found. The p-value is compared with the alpha value set by the researcher (usually.05) to determine if the result is statistically significant. If the p-value is less than the alpha level, the result is significant and the null hypothesis is rejected. If the p-value is greater than the alpha level, the result is non-significant and the researcher fails to reject the null hypothesis. When interpreting the p-value, it is important to understand the measurement as well as the practical significance of the results. The p-value indicates significance, but does not reveal the size of the effect. In addition, a non-significant p-value does not necessarily mean that there is no association; rather, the non-significant result could be due to a lack of power to detect an association. In clinical trials, the level of statistical significance depends on the number of participants studied and the observations made, as well as the magnitude of differences observed. 15

17 Skewed distribution and other distribution shapes (bimodal, J-shaped) Skewed Distribution. A skewed distribution is a distribution of scores or measures that produces a nonsymmetrical curve when plotted on a graph. The distribution may be positively skewed (infrequent scores on the high or right side of the distribution) or negatively skewed (infrequent scores on the low or left side of the distribution). The mean, mode, and median are not equal in a skewed distribution. Bimodal. A bimodal distribution is a distribution that has two modes. The bimodal distribution has two values that both occur with the highest frequency in the distribution. This distribution looks like it has two peaks where the data centers on the two values more frequently than other neighboring values. J-Shaped. A J-shaped distribution occurs when one of the first values on either end of the distribution occurs with the most frequency with the following values occurring less and less frequently so that the distribution is extremely asymmetrical and roughly resembles a J lying on its side. Standard error Standard error (SE) is a measure of the extent to which the sample mean deviates from the population mean. Another name for standard error (SE) is standard error of the mean (SEM). This alternative name gives more insight into the standard error statistic. The standard error is the standard deviation of the means of multiple samples from the same population. In other words, multiple samples are taken from a population and the standard error is the standard deviation of the mean of the multiple sample means. The standard error can be thought of as an index to how well the sample reflects the population. The smaller the standard error, the more the sampling distribution resembles the population. z-score The z-score (aka standard score) is the statistic of the standard normal distribution. The standard normal distribution has a mean of zero and a standard deviation of 1. Raw scores can be standardized into z-scores (thus also known as standard scores). The z-score measures the location of a raw score by its distance from the mean in standard deviation units. Since the mean of the standard normal distribution is zero, a z-score of 1 would reflect a raw score that falls one standard deviation from the mean. In the same manner, a z-score of -1 would reflect a raw score that falls exactly one standard deviation below the mean. If we were reading standardized IQ scores (raw mean = 100, SD = 15), for example, a z-score of 1 would reflect a raw score of 115 and a z-score of -1 would reflect a raw score of

18 Regression Models & SEM Bayesian Linear Regression Bayesian linear regression is an approach to linear regression in which the statistical analysis is undertaken within the context of Bayesian inference. When the regression model has errors that have a normal distribution, and if a particular form of prior distribution is assumed, explicit results are available for the posterior probability distributions of the model's parameters. Bayesian methods can be used for any probability distribution. Bootstrapped Estimates Bootstrapped estimates assume the sample is representative of the universe and do not make parametric assumptions about the data. Canonical Correlation A canonical correlation is the correlation of two canonical (latent) variables, one representing a set of independent variables, the other a set of dependent variables. Canonical correlation is used for many-to-many relationships. There may be more than one such linear correlation relating the two sets of variables, with each such correlation representing a different dimension by which the independent set of variables is related to the dependent set. The purpose of canonical correlation is to explain the relation of the two sets of variables, not to model the individual variables. Categorical Regression The goal of categorical regression is to describe the relationship between a response variable and a set of predictors. It is a variant of regression which can handle nominal independent variables, but now largely replaced by generalized linear models. Scale values are assigned to each category of every variable such that these values are optimal with respect to the regression. Cox Regression Cox regression may be used to analyze time-to-event as well as proximity, and preference data. Cox regression is designed for analysis of time until an event or time between events. The classic univariate example is time from diagnosis with a terminal illness until the event of death (hence survival analysis). The central statistical output is the hazard ratio. Curve Estimation Curve estimation lets the researcher explore how linear regression compares to any of 10 nonlinear models, for the case of one independent predicting one dependent, and thus is useful for exploring which procedures and models may be appropriate for relationships in one's data. Curve-fitting to compare linear, logarithmic, inverse, quadratic, cubic, power, compound, S-curve, logistic, growth, and exponential models based on their relative goodness of fit for models where a single dependent variable is predicted by a single independent variable or by a time variable. Discriminant Function Analysis Discriminant function analysis is used when the dependent variable is a dichotomy but other assumptions of multiple regression can be met, making it more powerful than logistic regression for binary or multinomial dependents. Discriminant function analysis, a.k.a. discriminant analysis or DA, is used to classify cases into the values of a categorical dependent, usually a dichotomy. If discriminant function analysis is effective for a set of data, the classification table of correct and incorrect estimates will yield a high percentage correct. 17

19 Multiple discriminant analysis (MDA) is an extension of discriminant analysis and a cousin of multiple analysis of variance (MANOVA), sharing many of the same assumptions and tests. MDA is used to classify a categorical dependent which has more than two categories, using as predictors a number of interval or dummy independent variables. MDA is sometimes also called discriminant factor analysis or canonical discriminant analysis. Dummy Coding Dummy variables are a way of adding the values of a nominal or ordinal variable to a regression equation. The standard approach to modeling categorical variables is to include the categorical variables in the regression equation by converting each level of each categorical variable into a variable of its own, usually coded 0 or 1. For instance, the categorical variable "region" may be converted into dummy variables such as "East," "West," "North," or "South." Typically "1" means the attribute of interest is present (ex., South = 1 means the case is from the region South). Of course, once the conversion is made, if we know a case's value on all the levels of a categorical variable except one, that last one is determined. We have to leave one of the levels out of the regression model to avoid perfect multicollinearity (singularity; redundancy), which will prevent a solution (for example, we may leave out "North" to avoid singularity). The omitted category is the reference category because b coefficients must be interpreted with reference to it. The interpretation of b coefficients is different when dummy variables are present. Normally, without dummy variables, the b coefficient is the amount the dependent variable increases when the independent variable associated with the b increases by one unit. When using a dummy variable such as "region" in the example above, the b coefficient is how much more the dependent variable increases (or decreases if b is negative) when the dummy variable increases one unit (thus shifting from 0=not present to 1=present, such as South=1=case is from the South) compared to the reference category (North, in our example). Thus for the set of dummy variables for "Region," assuming "North" is the reference category and education level is the dependent, a b of -1.5 for the dummy "South" means that the expected education level for the South is 1.5 years less than the average of "North" respondents. Entry Terms Forward/Backward/Stepwise/Blocking/Hierarchical Forward selection starts with the constant-only model and adding variables one at a time in the order they are best by some criterion (see below) until some cutoff level is reached (ex., until the step at which all variables not in the model have a significance higher than.05). Backward selection starts with all variables and deletes one at a time, in the order they are worst by some criterion. Stepwise multiple regression is a way of computing OLS regression in stages. In stage one, the independent variable best correlated with the dependent is included in the equation. In the second stage, the remaining independent with the highest partial correlation with the dependent, controlling for the first independent, is entered. This process is repeated, at each stage partialing for previously-entered independents, until the addition of a remaining independent does not increase R-squared by a significant amount (or until all variables are entered, of course). Alternatively, the process can work backward, starting with all variables and eliminating independents one at a time until the elimination of one makes a significant difference in R-squared. Hierarchical multiple regression (not to be confused with hierarchical linear models) is similar to stepwise regression, but the researcher, not the computer, determines the order of entry of the variables. F-tests are used to compute the significance of each added variable (or set of variables) to the explanation reflected in R- square. This hierarchical procedure is an alternative to comparing betas for purposes of assessing the importance of the independents. In more complex forms of hierarchical regression, the model may involve a series of intermediate variables which are dependents with respect to some other independents, but are themselves independents with respect to the ultimate dependent. Hierarchical multiple regression may then involve a series of regressions for each intermediate as well as for the ultimate dependent. 18

20 Exogenous and endogenous variables. Exogenous variables in a path model are those with no explicit causes (no arrows going to them, other than the measurement error term). If exogenous variables are correlated, this is indicated by a double headed arrow connecting them. Endogenous variables, then, are those which do have incoming arrows. Endogenous variables include intervening causal variables and dependents. Intervening endogenous variables have both incoming and outgoing causal arrows in the path diagram. The dependent variable(s) have only incoming arrows. Factor Analysis Factor analysis is used to uncover the latent structure (dimensions) of a set of variables. It reduces attribute space from a larger number of variables to a smaller number of factors and does not assume a dependent variable is specified. Exploratory factor analysis (EFA) seeks to uncover the underlying structure of a relatively large set of variables. The researcher's à priori assumption is that any indicator may be associated with any factor. This is the most common form of factor analysis. There is no prior theory and one uses factor loadings to intuit the factor structure of the data. Confirmatory factor analysis (CFA) seeks to determine if the number of factors and the loadings of measured (indicator) variables on them conform to what is expected on the basis of pre-established theory. Indicator variables are selected on the basis of prior theory and factor analysis is used to see if they load as predicted on the expected number of factors. The researcher's à priori assumption is that each factor (the number and labels of which may be specified à priori) is associated with a specified subset of indicator variables. A minimum requirement of confirmatory factor analysis is that one hypothesize beforehand the number of factors in the model, but usually also the researcher will posit expectations about which variables will load on which factors (Kim and Mueller, 1978b: 55). The researcher seeks to determine, for instance, if measures created to represent a latent variable really belong together. There are several different types of factor analysis, with the most common being principal components analysis (PCA), which is preferred for purposes of data reduction. However, common factor analysis is preferred for purposes of causal analysis and for confirmatory factor analysis in structural equation modeling, among other settings. Generalized Least Squares Generalized least squares (GLS) is an adaptation of OLS to minimize the sum of the differences between observed and predicted covariances rather than between estimates and scores. GLS works well even for nonnormal data when samples are large (n>2500). General Linear Model (multivariate) Although regression models may be run easily in GLM, as a practical matter univariate GLM is used primarily to run analysis of variance (ANOVA) and analysis of covariance (ANCOVA) models. Multivariate GLM is used primarily to run multiple analysis of variance (MANOVA) and multiple analysis of covariance (MANCOVA) models. Multiple regression with just covariates (and/or with dummy variables) yields the same inferences as multiple analysis of variance (MANOVA), to which it is statistically equivalent. GLM can implement regression models with multiple dependents. Generalized Linear Models/Generalized Estimating Equations GZLM/GEE are the generalization of linear modeling to a form covering almost any dependent distribution with almost any link function, thus supporting linear regression, Poisson regression, gamma regression, and many others.. GZLM is for variance and regression models which analyze normally distributed dependent variables using an identity link function (that is, prediction is directly of the values of the dependent). 19

21 Linear Mixed Models Linear mixed models (LMM) handle data where observations are not independent. That is, LMM correctly models correlated errors, whereas procedures in the general linear model family (GLM) usually do not. (GLM includes such procedures as t-tests, analysis of variance, correlation, regression, and factor analysis, to name a few.) LMM is a further generalization of GLM to better support analysis of a continuous dependent for: 1. random effects: where the set of values of a categorical predictor variable are seen not as the complete set but rather as a random sample of all values (ex., the variable "product" has values representing only 5 of a possible 42 brands). Through random effects models, the researcher can make inferences over a wider population in LMM than possible with GLM. 2. hierarchical effects: where predictor variables are measured at more than one level (ex., reading achievement scores at the student level and teacher-student ratios at the school level). 3. repeated measures: where observations are correlated rather than independent (ex., before-after studies, time series data, matched-pairs designs). LMM uses maximum likelihood estimation to estimate these parameters and supports more variations and data options. Hierarchical models in SPSS require LMM implementation. Linear mixed models include a variety of multi-level modeling (MLM) approaches, including hierarchical linear models, random coefficients models (RC), and covariance components models. Note that multi-level mixed models are based on a multi-level theory which specifies expected direct effects of variables on each other within any one level, and which specifies crosslevel interaction effects between variables located at different levels. That is, the researcher must postulate mediating mechanisms which cause variables at one level to influence variables at another level (ex., school-level funding may positively affect individual-level student performance by way of recruiting superior teachers, made possible by superior financial incentives). Multi-level modeling tests multi-level theories statistically, simultaneously modeling variables at different levels without necessary recourse to aggregation or disaggregation. It should be noted, though, that in practice some variables may represent aggregated scores. Logistic regression/odds ratio Logistic Regression. Logistic regression is a form of regression that is used with dichotomous dependent variables (usually scored 0, 1) and continuous and/or categorical independent variables. It is usually used for predicting if something will happen or not, for instance, pass/fail, heart disease, or anything that can be expressed as an Event or Non-Event. Logistic regression transforms the data by taking their natural logarithms to reduce nonlinearity. The technique estimates the odds of an event occurring by calculating changes in the log odds of the dependent variable. Logistic regression techniques do not assume linear relationships between the independent and dependent variables, does not require normally distributed variables, and does not assume homoscedasticity. However, the observations must be independent and the independent variables must be linearly related to the logit of the dependent variable. Odds Ratios. An odds ratio is the ratio of two odds. An odds ratio of 1.0 indicates that the independent has no effect on the dependent and that the variables are statistically independent. An odds ratio greater than 1 indicates that the independent variable increases the likelihood of the event. The "event" depends on the coding of the dependent variable. Typically, the dependent variable is coded as 0 or 1, with the 1 representing the event of interest. Therefore, a unit increase in the independent variable is associated with an increase in the odds that the dependent equals 1 in binomial logistic regression. An odds ratio less than 1 indicates that the independent variable decreases the likelihood of the event. That is, a unit increase in the independent variable is associated with a decrease in the odds of the dependent being 1. 20

22 Logit Regression Logit Regressionuses log-linear techniques to predict one or more categorical dependent variables. Logit models discriminate better than probit models for high and low potencies and are therefore more appropriate when the binary dependent is seen as representing an underlying equal distribution (large tails). The logit model is equivalent to binary logistic regression for grouped data. The logit is the value of the left-hand side of the equation and is the natural log of the odds ratio, p/(1-p), where p is the probability of response. Thus, if the probability is.025, the logit = ln(.025/.975) = -.366; if the probability is.5, the logit=ln(.5/.5) = 0; etc. Multicollinearity Multicollinearity refers to excessive correlation of the predictor variables. When correlation is excessive (some use the rule of thumb of r >.90), standard errors of the b and beta coefficients become large, making it difficult or impossible to assess the relative importance of the predictor variables. Multicollinearity is less important where the research purpose is sheer prediction since the predicted values of the dependent remain stable, but multicollinearity is a severe problem when the research purpose includes causal modeling... Multiple Linear Regression Multiple Linear Regression is employed to account for (predict) the variance in an interval dependent, based on linear combinations of interval, dichotomous, or dummy independent variables. Multiple regression can establish that a set of independent variables explains a proportion of the variance in a dependent variable at a significant level (through a significance test of R 2 ), and can establish the relative predictive importance of the independent variables. Power terms can be added as independent variables to explore curvilinear effects. Cross-product terms can be added as independent variables to explore interaction effects. One can test the significance of difference of two R 2 's to determine if adding an independent variable to the model helps significantly. Using hierarchical regression, one can see how most variance in the dependent can be explained by one or a set of new independent variables, over and above that explained by an earlier set. Of course, the estimates (b coefficients and constant) can be used to construct a prediction equation and generate predicted scores on a variable for further analysis. The multiple regression equation takes the form y = b1x1 + b2x bnxn + c. The b's are the regression coefficients, representing the amount the dependent variable y changes when the corresponding independent changes 1 unit. The c is the constant, where the regression line intercepts the y axis, representing the amount the dependent y will be when all the independent variables are 0. The standardized version of the b coefficients are the beta weights, and the ratio of the beta coefficients is the ratio of the relative predictive power of the independent variables. Associated with multiple regression is R 2, multiple correlation, which is the percent of variance in the dependent variable explained collectively by all of the independent variables. In addition, it is important that the model being tested is correctly specified. The exclusion of important causal variables or the inclusion of extraneous variables can change markedly the beta weights and hence the interpretation of the importance of the independent variables. Multinomial Logistic Regression Logistic regression for a categorical dependent variable with more than two levels. Negative Binomial Regression This is similar to the Poisson distribution, also used for count data, but is used when the variance is larger than the mean. Typically this is characterized by "there being too many 0's." It is not assumed all cases have an equal probability of experiencing the rare event, but rather that events may cluster. The negative binomial model is therefore sometimes called the "overdispersed Poisson model." Values must still be non-negative integers. The negative binomial is specified by an ancillary (dispersion) parameter, k. When k=0, the negative binomial is identical to the Poisson distribution. The researcher may specify k or allow it to be estimated by the program. 21

23 Nonlinear Regression Nonlinear regression refers to algorithms for fitting complex and even arbitrary curves to one's data using iterative estimation when the usual methods of dealing with nonlinearity fail. Simple curves can be implemented in general linear models (GLM) and OLS regression and in models supported by the generalized linear modeling because the dependent is transformed by some nonlinear link function). Nonlinear regression is used to fit curves not amenable to transformation methods. That is, it is used when the nonlinear relationship is intrinsically nonlinear because there is no possible transformation to linearize the relationship of the independent(s) to the dependent. Common models for nonlinear regression include logistic population growth models and asymptotic growth and decay models. Ordinal Regression Ordinal regression is a special case of generalized linear modeling (GZLM). Ordinal regression is used with ordinal dependent (response) variables, where the independents may be categorical factors or continuous covariates. Ordinal regression models are sometimes called cumulative logit models. Ordinal regression typically uses the logit link function, though other link functions are available. Ordinal regression with a logit link is also called a proportional odds model, since the parameters of the predictor variables may be converted to odds ratios, as in logistic regression. Ordinal regression requires assuming that the effect of the independents is the same for each level of the dependent. Thus if an independent is age, then the effect on the dependent for a 10 year increase in age should be the same whether the difference is between age 20 to age 30, or from age 50 to age 60. The "test of parallel lines assumption" tests this critical assumption, which should not be taken for granted. Ordinary Least Squares Regression This is the common form of multiple regression, used in early, stand-alone path analysis programs. It makes estimates based on minimizing the sum of squared deviations of the linear estimates from the observed scores. However, even for path modeling of one-indicator variables, MLE is still preferred in SEM because MLE estimates are computed simultaneously for the model as a whole, whereas OLS estimates are computed separately in relation to each endogenous variable. OLS assumes similar underlying distributions but not multivariate normality, is even less restrictive and is a better choice when MLE's multivariate normality assumption is severely violated. Path Analysis Path analysis is an extension of the regression model, used to test the fit of the correlation matrix against two or more causal models which are being compared by the researcher. The model is usually depicted in a circleand-arrow figure in which single-headed arrows indicate causation. A regression is done for each variable in the model as a dependent on others which the model indicates are causes. The regression weights predicted by the model are compared with the observed correlation matrix for the variables, and a goodness-of-fit statistic is calculated. The best-fitting of two or more models is selected by the researcher as the best model for advancement of theory. Path analysis requires the usual assumptions of regression. It is particularly sensitive to model specification because failure to include relevant causal variables or inclusion of extraneous variables often substantially affects the path coefficients, which are used to assess the relative importance of various direct and indirect causal paths to the dependent variable. Such interpretations should be undertaken in the context of comparing alternative models, after assessing their goodness of fit discussed in the section on structural equation modeling. When the variables in the model are latent variables measured by multiple observed indicators, path analysis is termed structural equation modeling. 22

24 Partial Least Squares Regression Sometimes called soft modeling because it makes relaxed assumptions about the data. PLS can support small sample models, even where there are more variables than observations, but it is lower in power than SEM approaches. The advantages of PLS include ability to model multiple dependents as well as multiple independents; ability to handle multicollinearity among the independents; robustness in the face of data noise and missing data; and creating independent latents directly on the basis of crossproducts involving the response variable(s), making for stronger predictions. Disadvantages of PLS include greater difficulty of interpreting the loadings of the independent latent variables (which are based on crossproduct relations with the response variables, not based as in common factor analysis on covariances among the manifest independents) and because the distributional properties of estimates are not known, the researcher cannot assess significance except through bootstrap induction. Poisson Regression Poisson regression is a form of regression analysis used to model count variables and contingency tables count data in event history analysis. It has a very strong assumption: the conditional variance equals the conditional mean. A Poisson regression model is sometimes known as a log-linear model, especially when used to model contingency tables. Data appropriate for Poisson regression do not happen very often. Nevertheless, Poisson regression is often used as a starting point for modeling count data and Poisson regression has many extensions. A rule of thumb is to use a Poisson rather than binomial distribution when n is 100 or more and the probability is.05 or less. Where the binomial distribution is used where the variable of interest is count of successes per given number of trials, the Poisson distribution is used for count of successes per given number of time units. The Poisson distribution is also used when "event occurs" can be counted but non-occurrence cannot be counted. Probit Algorithms Probit models are similar to logistic models but use a log-normal transformation (the probit transformation) of the dependent variable. Where logit and logistic regression are appropriate when the categories of the dependent are equal or well dispersed, probit may be recommended when the middle categories have greater frequencies than the high and low tail categories, or with binomial dependents when an underlying normal distribution is assumed. As a practical matter, probit and logistic models yield the same substantive conclusions for the same data the great majority of the time. R 2 Also called multiple correlation or coefficient of multiple determination, R 2 is the percent of the variance in the dependent explained uniquely or jointly by the independents. R-squared can also be interpreted as the proportionate reduction in error in estimating the dependent when knowing the independents. That is, R 2 reflects the number of errors made when using the regression model to guess the value of the dependent, in ratio to the total errors made when using only the dependent's mean as the basis for estimating all cases. Adjusted R-Square is an adjustment for the fact that when one has a large number of independents Recursive Partitioning Recursive partitioning creates a decision tree that attempts to correctly classify members of the population based on several dichotomous dependent variables. It creates a formula that researchers can use to calculate the probability that a participant belongs to a particular category. For example, a patient has a disease, recursive partitioning creates a rule such as 'If a patient has finding x, y, or z they probably have disease q' Advantages include generates clinically more intuitive models that do not require the user to perform calculations, allows varying prioritizing of misclassifications in order to create a decision rule that has more sensitivity or specificity, and may be more accurate. Disadvantages include does not work well for continuous variables and may overfit data. 23

25 Spuriousness A given bivariate correlation or beta weight may be inflated because one has not yet introduced control variables into the model by way of partial correlation. For instance, regressing height on hair length will generate a significant b coefficient, but only when gender is left out of the model specification (women are shorter and tend to have longer hair). Structural Equation Modeling Structural equation modeling (SEM) grows out of and serves purposes similar to multiple regression, but in a more powerful way which takes into account the modeling of interactions, nonlinearities, correlated independents, measurement error, correlated error terms, multiple latent independents each measured by multiple indicators, and one or more latent dependents also each with multiple indicators. SEM may be used as a more powerful alternative to multiple regression, path analysis, factor analysis, time series analysis, and analysis of covariance. That is, these procedures may be seen as special cases of SEM, or, to put it another way, SEM is an extension of the general linear model (GLM) of which multiple regression is a part. Advantages of SEM compared to multiple regression include more flexible assumptions (particularly allowing interpretation even in the face of multicollinearity), use of confirmatory factor analysis to reduce measurement error by having multiple indicators per latent variable, the attraction of SEM's graphical modeling interface, the desirability of testing models overall rather than coefficients individually, the ability to test models with multiple dependents, the ability to model mediating variables rather than be restricted to an additive model the ability to model error terms, the ability to test coefficients across multiple between-subjects groups, and ability to handle difficult data (time series with autocorrelated error, non-normal data, incomplete data). Moreover, where regression is highly susceptible to error of interpretation by misspecification, the SEM strategy of comparing alternative models to assess relative model fit makes it more robust. SEM is usually viewed as a confirmatory rather than exploratory procedure, using one of three approaches: 1. Strictly confirmatory approach 2. Alternative models approach 3. Model development approach Regardless of approach, SEM cannot itself draw causal arrows in models or resolve causal ambiguities. Theoretical insight and judgment by the researcher is still of utmost importance. SEM is a family of statistical techniques which incorporates and integrates path analysis and factor analysis. In fact, use of SEM software for a model in which each variable has only one indicator is a type of path analysis. Use of SEM software for a model in which each variable has multiple indicators but there are no direct effects (arrows) connecting the variables is a type of factor analysis. Usually, however, SEM refers to a hybrid model with both multiple indicators for each variable (called latent variables or factors), and paths specified connecting the latent variables. Synonyms for SEM are covariance structure analysis, covariance structure modeling, and analysis of covariance structures. Although these synonyms rightly indicate that analysis of covariance is the focus of SEM, be aware that SEM can also analyze the mean structure of a model. Suppression Suppression occurs when the omitted variable has a positive causal influence on the included independent and a negative influence on the included dependent (or vice versa), thereby masking the impact the independent would have on the dependent if the third variable did not exist. Note that when the omitted variable has a suppressing effect, coefficients in the model may underestimate rather than overestimate the effect of those variables on the dependent. 24

26 Two-Stage Least-Squares Regression Two-stage least squares regression (2SLS) is a method of extending regression to cover models which violate ordinary least squares (OLS) regression's assumption of recursivity (all arrows flow one way, with no feedback looping), specifically models where the researcher must assume that the disturbance term of the dependent variable is correlated with the cause(s) of the independent variable(s). Second, 2SLS is used for the same purpose to extend path analysis, except that in path models there may be multiple endogenous variables rather than a single dependent variable. Third, 2SLS is an alternative to maximum likelihood estimation (MLE) in estimating path parameters of non-recursive models with correlated error among the endogenous variables in structural equation modeling (SEM). Fourth, 2SLS can be used to test for selection bias in quasi-experimental studies involving a treatment group and a comparison group, in order to reject the hypothesis that selfselection or other forms of selection into the two groups accounts for differences in the dependent variable. Weight Estimation One of the critical assumptions of OLS regression is homoscedasticity: that the variance of residual error should be constant for all values of the independent(s). Weighted least squares (WLS) regression compensates for violation of the homoscedasticity assumption by weighting cases differentially: cases whose value on the dependent variable corresponds to large variances on the independent variable(s) count less and those with small variances count more in estimating the regression coefficients. That is, cases with greater weights contribute more to the fit of the regression line. The result is that the estimated coefficients are usually very close to what they would be in OLS regression, but under WLS regression their standard errors are smaller. Apart from its main function in correcting for heteroscedasticity, WLS regression is sometimes also used to adjust fit to give less weight to distant points and outliers, or to give less weight to observations thought to be less reliable. Zero-Inflated Regression Zero-inflated models attempt to account for excess zeros. In other words, two kinds of zeros are thought to exist in the data, "true zeros" and "excess zeros". Zero-inflated models estimate two equations simultaneously, one for the count model and one for the excess zeros. One common cause of over-dispersion is excess zeros by an additional data generating process. If the data generating process does not allow for any 0s (such as the number of days spent in the hospital), then a zero-truncated model may be more appropriate. 25

27 26