Multiple Regression Kean University February 12,

Size: px
Start display at page:

Download "Multiple Regression Kean University February 12, 2013 0"

Transcription

1 Multiple Regression Kean University February 12,

2 Content 1. Multiple Linear Regression.2 2. Logistic Regression.8 3. Statistical Definitions Regression Models & SEM

3 Multiple Linear Regression Regressions techniques are primarily used in order to create an equation which can be used to predict values of dependent variables for all members of the population. A secondary function of using regression is that it can be used as a means of explaining causal relationships between variables. Types of Linear Regression Standard Multiple Regression-All independent variables are entered into the analysis simultaneously Sequential Multiple Regression (Hierarchical Multiple Regression)-Independent variables are entered into the equation in a particular order as decided by the researcher Stepwise Multiple Regression-Typically used as an exploratory analysis, and used with large sets of predictors 1. Forward Selection-Bivariate correlations between all the IVs and the DV are calculated, and IVs are entered into the equation from the strongest correlate to the smallest 2. Stepwise Selection-Similar to forward selection; however, if in combination with other predictors and the IV no longer appears to contribute much to the equation, it is thrown out. 3. Backward Deletion-All IVs are entered into the equation. Partial F tests are calculated on each variable as if it were entered last, to determine the level of contribution to the overall prediction. The smallest partial F is removed based on a predetermined criteria. Variables IV- Also referred to as predictor variables, one or more continuous variables DV-Also referred to as the outcome variable, a single continuous variable Assumptions that must be met: 1. Normality. All errors should to be normally distributed, which can be tested by looking at the skewness, kurtosis, and histogram plots. Technically normality is necessary only for the t-tests to be valid, estimation of the coefficients only requires that the errors be identically and independently distributed 2. Independence. The errors associated with one observation are not correlated with the errors of any other observation. 3. Linearity. The relationship between the IVs and DV should be linear. 4. Homoscedasticity. The variances of the residuals across all levels of the IVs should be consistent, which can be tested by plotting the residuals. 5. Model specification. The model should be properly specified (including all relevant variables, and excluding irrelevant variables) Other important issues: Influence - individual observations that exert undue influence on the coefficients. Are there covariates that you should be including in your model? Collinearity - The predictor variables should be related, but not so strongly correlated that they are measuring the same thing (e.g. using age and grade), which will lead to multicollinearity. Multicollinearity misleadingly 2

4 inflates the standard errors. Thus, it makes some variables statistically insignificant while they should be otherwise significant. Unusual and Influential data A single observation that is substantially different from all other observations can make a large difference in the results of your regression analysis. If a single observation (or small group of observations) substantially changes your results, you would want to know about this and investigate further. There are three ways that an observation can be unusual. Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem. Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an observation deviates from the mean of that variable. These leverage points can have an unusually large effect on the estimate of regression coefficients. Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness. Collinearity Diagnostics VIF: Formally, variance inflation factors (VIF) measure how much the variance of the estimated coefficients are increased over the case of no correlation among the X variables. If no two X variables are correlated, then all the VIFs will be 1. If there are two or more variables that will have a VIF around or greater than 5 (some say up to 10 is okay), one of these variables must be removed from the regression model. To determine the best one to remove, remove each one individually. Select the regression equation that explains the most variance (R 2 the highest). Tolerance: Value should be greater than.10. Less than.10 indicates a collinearity issue. Other informal signs of multicollinearity are: Regression coefficients change drastically when adding or deleting an X variable. A regression coefficient is negative when theoretically Y should increase with increasing values of that X variable, or the regression coefficient is positive when theoretically Y should decrease with increasing values of that X variable. None of the individual coefficients has a significant t statistic, but the overall F test for fit is significant. A regression coefficient has a nonsignificant t statistic, even though on theoretical grounds that X variable should provide substantial information about Y. High pairwise correlations between the X variables. (But three or more X variables can be multicollinear together without having high pairwise correlations.) How to deal with multicollinearity: Increasing the sample size is a common first step, but only partially offsets the. The easiest solution: Remove the most intercorrelated variable(s) from analysis. This method is misguided if the variables were there due to the theory of the model. Combine variables into a composite variable through building indexes. Remember: in order to create an index, you need to have theoretical and empirical reasons to justify this action. Use centering: transform the offending independents by subtracting the mean from each case. 3

5 Drop the intercorrelated variables from analysis but substitute their crossproduct as an interaction term, or in some other way combine the intercorrelated variables. This is equivalent to respecifying the model by conceptualizing the correlated variables as indicators of a single latent variable. Note: if a correlated variable is a dummy variable, other dummies in that set should also be included in the combined variable in order to keep the set of dummies conceptually together. Leave one intercorrelated variable as is but then remove the variance in its covariates by regressing them on that variable and using the residuals. Assign the common variance to each of the covariates by some probably arbitrary procedure. Treat the common variance as a separate variable and decontaminate each covariate by regressing them on the others and using the residuals. That is, analyze the common variance as a separate variable. To Conduct the Analysis in SPSS 1. Your dataset must be open. To run the analysis, click analyze, then regression, then linear. 2. The Linear Regression window will open. Select the outcome variable, then the right arrow to put the variable in the dependent variable box. Highlight all of the independent variables, then the right arrow to put the variables into the Independent(s) box. 2. Place DV (outcome) in Dependent Box 2. Place IV s(predictors) in Independent Box 3. Select the method of regression that is most appropriate for the data set: a. Enter-enters all IVs, one at a time, into the model regardless of significant contribution 3. Select the appropriate b. Stepwise-combines Forward and regression method Backward methods, and utilizes criteria for both entering and removing IVs in the equation c. Remove-first uses and Enter method. Specific variable(s) is removed from the model and the Enter method is repeated d. Backward-enters all IVs one at a time and then removes then removes them one at a time based on a predetermined level of significance for removal (default is p.01). e. Forward-only enters IVs that significantly contribute to the model. 4

6 4. Click on the Statistics button to open the Statistics Dialogue box. Check the appropriate statistics, which usually includes Estimates, Model Fit, Descriptives, Part and Partial Correlations, Collinearity diagnostics. Note: if running a stepwise regression, check, R squared change. 4. Select the appropriate statistics 5. Click on the Options button to open the Options Dialogue box. Here you can change the inclusion and exclusion criteria depending on the method of regression used. 5. If applicable, change the probability or F value for inclusion or exclusion 6. Optional, if needed, click on the Plots button to add Plots and Histograms to the output. Also, clicking the save button gives options to save the residuals, etc. 7. To create Syntax file, simply click on Paste. Output Run the syntax and Output file should look similar to those below: The variables entered box shows which variables have be included or excluded from the regression analysis, and the method in which they have been entered. Depending on the method of regression used, certain variables maybe removed for failing to meet predetermined criteria. 5

7 R Square indicates the degree to which the amount of variance in the DV explained by the IVs. Use the adjusted R Square when you have more than 1 predictor (IV). The model summary box outlines the overall fit of the model. R is the correlation between the variables, which should be the same as shown in the Correlations table. R Square value indicates the amount of variance in the dependent variable by the predictor variables. In this case, the predictor variables account for 9.8% of the variance in number of offenses. The adjusted R Square is a more conservative estimate of variance explained and removes variability that is likely due to chance; however, this value is not often reported or interpreted. The ANOVA is used to test whether or not the model significantly predicts the outcome variable. In this example, the model does significantly predicts the outcome variable, because p <.001. The F ratio and significance tells the degree to which the model predicts the DV The Coefficients box notes the degree and significance that each predictor has on the outcome variable. In this example, only whether or not one is incarcerated and their entry age are significant predictors. When conducting regression analyses, it may be useful to run multiple combinations of predictor variables and regression methods. Unstandardized B is the degree to which each predictor variable impacts the outcome variable (DV) Standardized Betas tell the amount of variance of the DV that is explained by each predictor variable individually The T statistic and Significance tell whether or not each predictor is significant 6

8 Sample Write Up & Table A multiple regression was also conducted to predict the number of offenses based on the available independent variables. The predictors included incarcerated (vs. not incarcerated), the age at first offense, the number of days in placement, and race. The overall model was significant, F (5, 220) = 4.80, p <.001, and accounted for 31% of the variance. The results indicated that incarcerated and the age at first offense were significant predictors of the number of offenses committed (see Table 9). The number of days in placement and race were not significant predictors of the number of offenses committed. Incarceration (compared to non-incarceration) was associated with an increase in the number of offenses committed (Beta =.243, p <.01). In addition, controlling for the other predictors, as age at first offense increased, the number of offenses also increased (Beta =.187, p <.01). Table 9 Multiple Regression Analyses of Incarceration Status, Age, Days in Placement, and Race on Number of Offenses (N = 226) Unstandardized B SE Beta t p Incarcerated Age at First Offense Days in Placement African American Caucasian Note. F (5, 220) = 4.80, p <.001, R 2 =.098 7

9 Logistic Regression Logistic regression, also referred to as binary or multinomial regression, is a type of prediction analysis that predicts a dichotomous dependent variable based on a set of independent variables. Variables DV-one dichotomous dependent variable (e.g. alive/dead, married/single, purchase/not purchase) IVs-one or more independent variable that can be either continuous or categorical Assumptions that must be met: 1. Sample Size. Reducing a continuous variable to a binary or categorical one loses information and attenuates effect sizes, reducing the power of the logistic procedure. Therefore, in many cases, a larger sample size is needed to insure power of the statistical procedure. It is recommended that a sample size be at least 30 times the number of parameters or 10 cases per independent variable. 2. Meaningful coding. Logistic coefficients will be difficult to interpret if not coded meaningfully. The convention for binary logistic regression is to code the dependent class of greatest interest as 1 ("the event occurring") and the other class as 0 ("the event not occurring). 3. Proper specification of the model is particularly crucial; parameters may change magnitude and even direction when variables are added to or removed from the model. a. Inclusion of all relevant variables in the regression model: If relevant variables are omitted, the common variance they share with included variables may be wrongly attributed to those variables, or the error term may be inflated. b. Exclusion of all irrelevant variables: If causally irrelevant variables are included in the model, the common variance they share with included variables may be wrongly attributed to the irrelevant variables. The more the correlation of the irrelevant variable(s) with other independents, the greater the standard errors of the regression coefficients for these independents. 4. Linearity. Logistic regression does not require linear relationships between the independent factor or covariates and the dependent, but it does assume a linear relationship between the independents and the log odds (logit) of the dependent. a. Box-Tidwell Transformation (Test): Add to the logistic model interaction terms which are the crossproduct of each independent times its natural logarithm. If these terms are significant, then there is nonlinearity in the logit. This method is not sensitive to small nonlinearities. 5. No outliers. As in logistic regression, outliers can affect results significantly. Types of Logistic Regression Binary Logistic Regression-treats all IVs as continuous covariates and categorical variables must be set in SPSS Multinomial Logistic Regression- All IVs are explicitly entered as factors, and the reference category of the outcome variable must be set in SPSS. 8

10 To conduct this analysis in SPSS 1. Your data set must be open. To run the analysis, click Analyze, then Regression, then either Binary or Multinomial Logistic. 2. Move the DV into the Dependent box, and move the IVs into the Covariates Box. 2. Move the DV to the Dependent Box 2. Move the IVs to the Covariates Box 3. Select the appropriate inclusion method 4. Check this box to display the confidence interval 3. Select the method of regression that is most appropriate for the data set: a. Enter-enters all IVs, one at a time, into the model regardless of significant contribution b. Stepwise-combines Forward and Backward methods, and utilizes criteria for both entering and removing IVs in the equation c. Remove-first uses and Enter method. Specific variable(s) is removed from the model and the Enter method is repeated d. Backward-enters all IVs one at a time and then removes then removes them one at a time based on a predetermined level of significance for removal (default is p.01). e. Forward-only enters IVs that significantly contribute to the model. 4. Click on the Options box, and check the box next to CI for exp(b). Then click Continue. 9

11 5. Paste and run the syntax. Report the number of cases included in the analysis Output Run the syntax, and Output file should look similar to those below: The first box outlines how many cases were included and excluded from the analysis, you will report the n included in the analysis in your write up. The dependent variable encoding box shows the label for each coding. This is important to note, because SPSS creates the regression equation based on the likelihood of having a value of 1. In this case, SPSS is creating an equation to predict the likelihood that an individual is not very satisfied. The next set of tables falls under the heading of Block 0: Beginning Block, and consist of three tables: Classification Table, Variables in the Equation, and Variables not in the Equation. This block provides a description of the null model and does not include the predictor variables. These values are not interpreted or report. The Beginning Block does not include the predictor variables. DO NOT report or interpret these values. Block 1 is what is interpreted and reported in the write up. -2 Log is not interpreted or reported. The Omnibus Test uses a Chi-Square to determine if the model is statistically significant. Cox & Snell R Square and Nagelkerke R Square are measures of effect size. Typically Nagelkerke is reported over Cox & Snell. Based on the regression equation created from the analysis, SPSS will predicts which group individual cases will belong to. SPSS then calculates the percentage of correct predictions. 10

12 The estimated coefficient and standard error are reported, but not interpreted Report significance of each predictor Report the confidence intervals for both the lower and upper limits Exp(B) is the odds ratio for each predictor. As mentioned, SPSS is predicting the likelihood of the DV being a 1, in this case Not Very Satisfied. When the odds ratio is less than 1, increasing values in the variable correspond to decreasing odds of the event s occurrence. When the odds ratio is greater than 1, increasing values of the variable corresponds to increasing odds of the event s occurrence. Sample Write Up and Table A logistic regression analysis was conducted to predict if an individual was not very satisfied with his or her job, see Table 1. Overall, the model was significant, χ 2 (4) = 16.71, p =.002, Nagelkerke R 2 =.025. Of all the predictor variables, only age was a significant predictor, p <.001, and had an odds ratio of.980, indicating that as an individual s age increases he or she is less likely to be not very satisfied with his or her job. As a predictor, years of education was marginally significant, p =.085, and had an odds ratio of.957, indicating that as years of education increases, there is a decrease in the likelihood of being not very satisfied with one s job. None of the remaining predictors (e.g., hours worked per week, number of siblings) were not significant predictors of job satisfaction, ns. Table 1 Summary of Logistic Regression Predicting Job Satisfaction Satisfied or Not Satisfied Odds 95% CI β Ratio Lower Upper p Age Years of Education Hours per Week Number of Siblings Note.: χ 2 (4) = 16.71, p =.002, Nagelkerke R 2 =

13 Binary (dichotomous) variable Statistics Definitions A binary variable has only two values, typically 0 or 1. Similarly, a dichotomous variable is a categorical variable with only two values. Examples include success or failure; male or female; alive or dead. Categorical variable A variable that can be placed into separate categories based on some characteristic or attribute. Also referred to as qualitative, discrete, or nominal variables. Examples include gender, drug treatments, race or ethnicity, disease subtypes, dosage level. Causal relationship A causal relationship is one in which a change in one variable can be attributed to a change in another variable. The study needs to be designed in a way that it is legitimate to infer cause. In most cases, the term causal conclusion indicates findings from an experiment in which the subjects are randomly assigned to a control or experimental group. For instance, causality cannot be determined from a correlational research design. Furthermore, it is important to note that a significant finding (small p-value) does not signify causality. The medical statistician, Austin B. Hill, outlined nine criteria to establish causality in epidemiological research: temporal relationship, strength, dose-response relationship, plausibility, consideration of alternate explanations, experiment, specificity, and coherence. Central Limit Theorem The Central Limit Theorem is the foundation for many statistical techniques. The theorem proposes that the larger the sample size (>30), the more closely the sampling distribution of the mean will approach a normal distribution. The mean of the sampling distribution of the mean will approach the true population mean have a standard deviation of σ / n (population variance / square root of n). The population from which the sample is drawn does not need to be normally distributed. Furthermore, the Central Limit Theorem explains that the approximation improves with larger samples as well as why sampling error is smaller with larger samples than it is with smaller samples. Confidence interval A confidence interval is an interval estimate of a population parameter, consisting of a range of values bounded by upper and lower confidence limits. The parameter is estimated as falling somewhere between the two values. Researchers can assign a degree of confidence for the interval estimate (typically 90%, 95%, 99%), indicating that the interval will include the population parameter that percentage of the time (i.e., 90%, 95%, 99%). The wider the confidence interval, the higher the confidence level. Confounding variable A confounding variable is one that obscures the effects of another variable. In other words, a confounding variable is one that is associated with both the independent and dependent (outcome) variables, and therefore affects the results. Confounding variables are also called extraneous variables and are problematic because the researcher cannot be sure if results are due to the independent variable, the confounding variable, or both. Smoking, for instance, would be a confounding variable in the relationship between drinking alcohol and lung cancer. Therefore, a researcher studying the relationship between alcohol consumption and lung cancer should control for the effects of smoking. A positive confounder is related to the independent and dependent variables in the same direction; a negative confounder displays an opposite relationship to the two variables. If there is a confounding effect, researchers can use a stratified sample and/or a statistical model that controls for the confounding variable (e.g., multiple regression, analysis of covariance). 12

14 Continuous variables A variable that can take on any value within the limits the variable ranges. Continuous variables are measured on ratio or interval scales. Examples include age, temperature, height, and weight. Control group In experimental research and many types of clinical trials, the control group is the group of participants that does not receive the treatment. The control group is used for comparison and is treated exactly like the experimental group except that it does not receive the experimental treatment. In many clinical trials, one group of patients will be given an experimental drug or treatment, while the control group is given either a standard treatment for the illness or a placebo (ex: sugar pill). Covariate A covariate is a variable that is statistically controlled for using techniques such as multiple regression analysis or analysis of covariance. Covariates are also known as control variables and in general, have a linear relationship with the dependent variable. Using covariates in analyses allow the researcher to produce more precise estimates of the effect of the independent variable of interest. In order to determine if the use of a covariate is legitimate, the effect of the covariate on the residual (error) variance should be examined. If the covariate reduces the error, then is likely to improve the analysis. Degrees of freedom The degrees of freedom is usually abbreviated df and represents the number of values free to vary when calculating a statistic. For instance, the degrees of freedom in a 2x2 crosstab table are calculated by multiplying the number of rows minus 1 by the number of columns minus 1. Therefore, if the totals are fixed, only one of the four cell counts is free to vary, and the df = (2-1) (2-1) = 1. Dependent variable The dependent variable is the effect of interest that is measured in the study. It is termed the dependent variable because it depends on another variable. Also referred to as outcome variables or criterion variables. Descriptive / Inferential Statistics Descriptive Statistics. Descriptive statistics provide a summary of the available data. The descriptive statistics are used to simplify large amounts of data by summarizing, organizing, and graphing quantitative information. Typical descriptive statistics include measures of central tendency (mean, median, mode) and measures of variability or spread (range, standard deviation, variance). Inferential Statistics. Inferential statistics allow researchers to draw conclusions or inferences from the data. Typically, inferential statistics are used to make inferences or claims about a population based on a sample drawn from that population. Examples include independent t tests and Analysis of Variance (ANOVA) techniques. Effect size An effect size is a measure of the strength of the relationship between two variables. Sample-based effect sizes are distinguished from test statistics used in hypothesis testing, in that they estimate the strength of an apparent relationship, rather than assigning a significance level reflecting whether the relationship could be due to chance. The effect size does not determine the significance level, or vice-versa. Some fields using effect sizes apply words such as "small", "medium" and "large" to the size of the effect. Whether an effect size should be interpreted small, medium, or big depends on its substantial context and its operational definition. Some common measures of effect size are Cohen s D, Cramer s V, Odds Ratios, Standardized Beta weights, Pearson s R, and partial Eta squared. 13

15 Independent variable The independent variables are typically controlled or manipulated by the researcher. Independent variables are also used to predict the values of another variable. Furthermore, researchers often use demographic variables (e.g., gender, race, age) as independent variables in statistical analysis. Examples of independent variables include the treatment given to groups, dosage level of an experimental drug, gender, and race. Measures of Central Tendency: Mean, Median, Mode Measures of central tendency are a way of summarizing data using the value which is most typical or representative, including the mean, median, and mode. Mean. The mean (strictly speaking arithmetic mean) is also known as the average. It is calculated by adding up the values for each case and dividing by the total number of cases. It is often symbolized by M or X ( Xbar ). The mean is influenced by outliers and also should not be used with skewed distributions. Median. The median is the central value of a set of values, ranked in ascending (or descending) order. Since 50% of all scores fall at or below the 50th percentile, the median is therefore the score located at the 50th percentile. The median is not influenced by extreme scores and is the preferred measure of central tendency for a skewed distribution. Mode. The mode is the value which occurs most frequently in a set of scores. The mode is not influenced by extreme values. Measures of Dispersion: Variance, Standard deviation, range Measures of dispersion include statistics that show the amount of variation or spread in the scores, or values of, a variable. Widely scattered or variable data results in large measures of dispersion, whereas tightly clustered data results in small measures of dispersion. Commonly used measures of dispersion include the variance and the standard deviation. Variance. A measure of the amount of variability in a set of scores. Variance is calculated as the square of the standard deviation of scores. Larger values for the variance indicate that individual cases are further away from the mean and a wider distribution. Smaller variances indicate that individual cases are closer to the mean and a tighter distribution. The population variance is symbolized by σ 2 and the sample variance is symbolized by s 2. Standard Deviation. A measure of spread or dispersion in a set of scores. The standard deviation is the square root of the variance. Similar to the variance, the more widely the scores are spread out, the larger the standard deviation. Unlike the variance, which is expressed in squared units of measurement, the standard deviation is expressed in the same units as the measurements of the original data. In the event that the standard deviation is greater than the mean, the mean would be deemed inappropriate as a representative measure of central tendency. The empirical rule states that for normal distributions, approximately 68% of the distribution falls within ± 1 standard deviation of the mean, 95% of the distribution falls within ± 2 standard deviations of the mean, and 99.7% of the distribution falls within ± 3 standard deviations of the mean. The standard deviation is symbolized by SD or s. Multivariate / Bivariate / Univariate Multivariate. Quantitative methods for examining multiple variables at the same time. For instance, designs that involve two or more independent variables and two or more dependent variables would use multivariate analytic techniques. Examples include multiple regression analysis, MANOVA, factor analysis, and discriminant analysis. Bivariate. Quantitative methods that involve two variables. Univariate. Methods that involve only one variable. Often used to refer to techniques in which there is only one outcome or dependent variable. 14

16 Normal distribution The normal distribution is a bell-shaped, theoretical continuous probability distribution. The horizontal axis represents all possible values of a variable and the vertical axis represents the probability of those values. The scores on the variable are clustered around the mean in a symmetrical, unimodal fashion. The mean, median, and mode are all the same in the normal distribution. The normal distribution is widely used in statistical inference. Null / Alternative Hypotheses Null Hypothesis. In general, the null hypothesis (H 0 ) is a statement of no effect. The null hypothesis is set up under the assumption that it is true, and is therefore tested for rejection. Alternative Hypothesis. The hypothesis alternative to the one being tested (i.e., the alternative to the null hypothesis. The alternative hypothesis is denoted by H a or H 1 and is also known as the research or experimental hypothesis. Rejecting the null hypothesis (on the basis of some statistical test) indicates that the alternative hypothesis may be true. Parametric / non-parametric Parametric Statistics: Statistical techniques that require the data to have certain characteristics (approximately normally distributed, interval/ ratio scale of measurement). Also called inferential statistics. Non-Parametric Statistics: Statistical techniques designed for use when the data does not conform to the characteristics required for parametric tests. Non-parametric statistics are also known as distribution-free statistics. Examples include the Mann-Whitney U test, Kruskal-Wallis test and Wilcoxon's (T) test. In general, parametric tests are more robust, more complicated to compute, and have greater power efficiency. Population / Sample The population is the group of persons or objects that the researcher is interested in studying. To generalize about a population, the researcher studies a sample that is representative of the population. Power The power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is false (i.e. that it will not make a Type II error, or a false negative decision). As the power increases, the chances of a Type II error occurring decrease. Power analysis can be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given size. Power analysis can also be used to calculate the minimum effect size that is likely to be detected in a study using a given sample size. In addition, the concept of power is used to make comparisons between different statistical testing procedures: for example, between a parametric and a nonparametric test of the same hypothesis. p-value The p-value stands for probability value and represents the likelihood that a result is due to chance alone. More specifically, the p-value is the probability of obtaining a result at least as extreme as the one that was actually observed given that the null hypothesis is true. For instance, given a p -value of 0.05 (1/20) and repeated experiments, we would expect that in approximately every 20 replications of the experiment, there would be one in which the relationship between the variables would be equal to or more extreme than what was found. The p-value is compared with the alpha value set by the researcher (usually.05) to determine if the result is statistically significant. If the p-value is less than the alpha level, the result is significant and the null hypothesis is rejected. If the p-value is greater than the alpha level, the result is non-significant and the researcher fails to reject the null hypothesis. When interpreting the p-value, it is important to understand the measurement as well as the practical significance of the results. The p-value indicates significance, but does not reveal the size of the effect. In addition, a non-significant p-value does not necessarily mean that there is no association; rather, the non-significant result could be due to a lack of power to detect an association. In clinical trials, the level of statistical significance depends on the number of participants studied and the observations made, as well as the magnitude of differences observed. 15

17 Skewed distribution and other distribution shapes (bimodal, J-shaped) Skewed Distribution. A skewed distribution is a distribution of scores or measures that produces a nonsymmetrical curve when plotted on a graph. The distribution may be positively skewed (infrequent scores on the high or right side of the distribution) or negatively skewed (infrequent scores on the low or left side of the distribution). The mean, mode, and median are not equal in a skewed distribution. Bimodal. A bimodal distribution is a distribution that has two modes. The bimodal distribution has two values that both occur with the highest frequency in the distribution. This distribution looks like it has two peaks where the data centers on the two values more frequently than other neighboring values. J-Shaped. A J-shaped distribution occurs when one of the first values on either end of the distribution occurs with the most frequency with the following values occurring less and less frequently so that the distribution is extremely asymmetrical and roughly resembles a J lying on its side. Standard error Standard error (SE) is a measure of the extent to which the sample mean deviates from the population mean. Another name for standard error (SE) is standard error of the mean (SEM). This alternative name gives more insight into the standard error statistic. The standard error is the standard deviation of the means of multiple samples from the same population. In other words, multiple samples are taken from a population and the standard error is the standard deviation of the mean of the multiple sample means. The standard error can be thought of as an index to how well the sample reflects the population. The smaller the standard error, the more the sampling distribution resembles the population. z-score The z-score (aka standard score) is the statistic of the standard normal distribution. The standard normal distribution has a mean of zero and a standard deviation of 1. Raw scores can be standardized into z-scores (thus also known as standard scores). The z-score measures the location of a raw score by its distance from the mean in standard deviation units. Since the mean of the standard normal distribution is zero, a z-score of 1 would reflect a raw score that falls one standard deviation from the mean. In the same manner, a z-score of -1 would reflect a raw score that falls exactly one standard deviation below the mean. If we were reading standardized IQ scores (raw mean = 100, SD = 15), for example, a z-score of 1 would reflect a raw score of 115 and a z-score of -1 would reflect a raw score of

18 Regression Models & SEM Bayesian Linear Regression Bayesian linear regression is an approach to linear regression in which the statistical analysis is undertaken within the context of Bayesian inference. When the regression model has errors that have a normal distribution, and if a particular form of prior distribution is assumed, explicit results are available for the posterior probability distributions of the model's parameters. Bayesian methods can be used for any probability distribution. Bootstrapped Estimates Bootstrapped estimates assume the sample is representative of the universe and do not make parametric assumptions about the data. Canonical Correlation A canonical correlation is the correlation of two canonical (latent) variables, one representing a set of independent variables, the other a set of dependent variables. Canonical correlation is used for many-to-many relationships. There may be more than one such linear correlation relating the two sets of variables, with each such correlation representing a different dimension by which the independent set of variables is related to the dependent set. The purpose of canonical correlation is to explain the relation of the two sets of variables, not to model the individual variables. Categorical Regression The goal of categorical regression is to describe the relationship between a response variable and a set of predictors. It is a variant of regression which can handle nominal independent variables, but now largely replaced by generalized linear models. Scale values are assigned to each category of every variable such that these values are optimal with respect to the regression. Cox Regression Cox regression may be used to analyze time-to-event as well as proximity, and preference data. Cox regression is designed for analysis of time until an event or time between events. The classic univariate example is time from diagnosis with a terminal illness until the event of death (hence survival analysis). The central statistical output is the hazard ratio. Curve Estimation Curve estimation lets the researcher explore how linear regression compares to any of 10 nonlinear models, for the case of one independent predicting one dependent, and thus is useful for exploring which procedures and models may be appropriate for relationships in one's data. Curve-fitting to compare linear, logarithmic, inverse, quadratic, cubic, power, compound, S-curve, logistic, growth, and exponential models based on their relative goodness of fit for models where a single dependent variable is predicted by a single independent variable or by a time variable. Discriminant Function Analysis Discriminant function analysis is used when the dependent variable is a dichotomy but other assumptions of multiple regression can be met, making it more powerful than logistic regression for binary or multinomial dependents. Discriminant function analysis, a.k.a. discriminant analysis or DA, is used to classify cases into the values of a categorical dependent, usually a dichotomy. If discriminant function analysis is effective for a set of data, the classification table of correct and incorrect estimates will yield a high percentage correct. 17

19 Multiple discriminant analysis (MDA) is an extension of discriminant analysis and a cousin of multiple analysis of variance (MANOVA), sharing many of the same assumptions and tests. MDA is used to classify a categorical dependent which has more than two categories, using as predictors a number of interval or dummy independent variables. MDA is sometimes also called discriminant factor analysis or canonical discriminant analysis. Dummy Coding Dummy variables are a way of adding the values of a nominal or ordinal variable to a regression equation. The standard approach to modeling categorical variables is to include the categorical variables in the regression equation by converting each level of each categorical variable into a variable of its own, usually coded 0 or 1. For instance, the categorical variable "region" may be converted into dummy variables such as "East," "West," "North," or "South." Typically "1" means the attribute of interest is present (ex., South = 1 means the case is from the region South). Of course, once the conversion is made, if we know a case's value on all the levels of a categorical variable except one, that last one is determined. We have to leave one of the levels out of the regression model to avoid perfect multicollinearity (singularity; redundancy), which will prevent a solution (for example, we may leave out "North" to avoid singularity). The omitted category is the reference category because b coefficients must be interpreted with reference to it. The interpretation of b coefficients is different when dummy variables are present. Normally, without dummy variables, the b coefficient is the amount the dependent variable increases when the independent variable associated with the b increases by one unit. When using a dummy variable such as "region" in the example above, the b coefficient is how much more the dependent variable increases (or decreases if b is negative) when the dummy variable increases one unit (thus shifting from 0=not present to 1=present, such as South=1=case is from the South) compared to the reference category (North, in our example). Thus for the set of dummy variables for "Region," assuming "North" is the reference category and education level is the dependent, a b of -1.5 for the dummy "South" means that the expected education level for the South is 1.5 years less than the average of "North" respondents. Entry Terms Forward/Backward/Stepwise/Blocking/Hierarchical Forward selection starts with the constant-only model and adding variables one at a time in the order they are best by some criterion (see below) until some cutoff level is reached (ex., until the step at which all variables not in the model have a significance higher than.05). Backward selection starts with all variables and deletes one at a time, in the order they are worst by some criterion. Stepwise multiple regression is a way of computing OLS regression in stages. In stage one, the independent variable best correlated with the dependent is included in the equation. In the second stage, the remaining independent with the highest partial correlation with the dependent, controlling for the first independent, is entered. This process is repeated, at each stage partialing for previously-entered independents, until the addition of a remaining independent does not increase R-squared by a significant amount (or until all variables are entered, of course). Alternatively, the process can work backward, starting with all variables and eliminating independents one at a time until the elimination of one makes a significant difference in R-squared. Hierarchical multiple regression (not to be confused with hierarchical linear models) is similar to stepwise regression, but the researcher, not the computer, determines the order of entry of the variables. F-tests are used to compute the significance of each added variable (or set of variables) to the explanation reflected in R- square. This hierarchical procedure is an alternative to comparing betas for purposes of assessing the importance of the independents. In more complex forms of hierarchical regression, the model may involve a series of intermediate variables which are dependents with respect to some other independents, but are themselves independents with respect to the ultimate dependent. Hierarchical multiple regression may then involve a series of regressions for each intermediate as well as for the ultimate dependent. 18

20 Exogenous and endogenous variables. Exogenous variables in a path model are those with no explicit causes (no arrows going to them, other than the measurement error term). If exogenous variables are correlated, this is indicated by a double headed arrow connecting them. Endogenous variables, then, are those which do have incoming arrows. Endogenous variables include intervening causal variables and dependents. Intervening endogenous variables have both incoming and outgoing causal arrows in the path diagram. The dependent variable(s) have only incoming arrows. Factor Analysis Factor analysis is used to uncover the latent structure (dimensions) of a set of variables. It reduces attribute space from a larger number of variables to a smaller number of factors and does not assume a dependent variable is specified. Exploratory factor analysis (EFA) seeks to uncover the underlying structure of a relatively large set of variables. The researcher's à priori assumption is that any indicator may be associated with any factor. This is the most common form of factor analysis. There is no prior theory and one uses factor loadings to intuit the factor structure of the data. Confirmatory factor analysis (CFA) seeks to determine if the number of factors and the loadings of measured (indicator) variables on them conform to what is expected on the basis of pre-established theory. Indicator variables are selected on the basis of prior theory and factor analysis is used to see if they load as predicted on the expected number of factors. The researcher's à priori assumption is that each factor (the number and labels of which may be specified à priori) is associated with a specified subset of indicator variables. A minimum requirement of confirmatory factor analysis is that one hypothesize beforehand the number of factors in the model, but usually also the researcher will posit expectations about which variables will load on which factors (Kim and Mueller, 1978b: 55). The researcher seeks to determine, for instance, if measures created to represent a latent variable really belong together. There are several different types of factor analysis, with the most common being principal components analysis (PCA), which is preferred for purposes of data reduction. However, common factor analysis is preferred for purposes of causal analysis and for confirmatory factor analysis in structural equation modeling, among other settings. Generalized Least Squares Generalized least squares (GLS) is an adaptation of OLS to minimize the sum of the differences between observed and predicted covariances rather than between estimates and scores. GLS works well even for nonnormal data when samples are large (n>2500). General Linear Model (multivariate) Although regression models may be run easily in GLM, as a practical matter univariate GLM is used primarily to run analysis of variance (ANOVA) and analysis of covariance (ANCOVA) models. Multivariate GLM is used primarily to run multiple analysis of variance (MANOVA) and multiple analysis of covariance (MANCOVA) models. Multiple regression with just covariates (and/or with dummy variables) yields the same inferences as multiple analysis of variance (MANOVA), to which it is statistically equivalent. GLM can implement regression models with multiple dependents. Generalized Linear Models/Generalized Estimating Equations GZLM/GEE are the generalization of linear modeling to a form covering almost any dependent distribution with almost any link function, thus supporting linear regression, Poisson regression, gamma regression, and many others.. GZLM is for variance and regression models which analyze normally distributed dependent variables using an identity link function (that is, prediction is directly of the values of the dependent). 19

21 Linear Mixed Models Linear mixed models (LMM) handle data where observations are not independent. That is, LMM correctly models correlated errors, whereas procedures in the general linear model family (GLM) usually do not. (GLM includes such procedures as t-tests, analysis of variance, correlation, regression, and factor analysis, to name a few.) LMM is a further generalization of GLM to better support analysis of a continuous dependent for: 1. random effects: where the set of values of a categorical predictor variable are seen not as the complete set but rather as a random sample of all values (ex., the variable "product" has values representing only 5 of a possible 42 brands). Through random effects models, the researcher can make inferences over a wider population in LMM than possible with GLM. 2. hierarchical effects: where predictor variables are measured at more than one level (ex., reading achievement scores at the student level and teacher-student ratios at the school level). 3. repeated measures: where observations are correlated rather than independent (ex., before-after studies, time series data, matched-pairs designs). LMM uses maximum likelihood estimation to estimate these parameters and supports more variations and data options. Hierarchical models in SPSS require LMM implementation. Linear mixed models include a variety of multi-level modeling (MLM) approaches, including hierarchical linear models, random coefficients models (RC), and covariance components models. Note that multi-level mixed models are based on a multi-level theory which specifies expected direct effects of variables on each other within any one level, and which specifies crosslevel interaction effects between variables located at different levels. That is, the researcher must postulate mediating mechanisms which cause variables at one level to influence variables at another level (ex., school-level funding may positively affect individual-level student performance by way of recruiting superior teachers, made possible by superior financial incentives). Multi-level modeling tests multi-level theories statistically, simultaneously modeling variables at different levels without necessary recourse to aggregation or disaggregation. It should be noted, though, that in practice some variables may represent aggregated scores. Logistic regression/odds ratio Logistic Regression. Logistic regression is a form of regression that is used with dichotomous dependent variables (usually scored 0, 1) and continuous and/or categorical independent variables. It is usually used for predicting if something will happen or not, for instance, pass/fail, heart disease, or anything that can be expressed as an Event or Non-Event. Logistic regression transforms the data by taking their natural logarithms to reduce nonlinearity. The technique estimates the odds of an event occurring by calculating changes in the log odds of the dependent variable. Logistic regression techniques do not assume linear relationships between the independent and dependent variables, does not require normally distributed variables, and does not assume homoscedasticity. However, the observations must be independent and the independent variables must be linearly related to the logit of the dependent variable. Odds Ratios. An odds ratio is the ratio of two odds. An odds ratio of 1.0 indicates that the independent has no effect on the dependent and that the variables are statistically independent. An odds ratio greater than 1 indicates that the independent variable increases the likelihood of the event. The "event" depends on the coding of the dependent variable. Typically, the dependent variable is coded as 0 or 1, with the 1 representing the event of interest. Therefore, a unit increase in the independent variable is associated with an increase in the odds that the dependent equals 1 in binomial logistic regression. An odds ratio less than 1 indicates that the independent variable decreases the likelihood of the event. That is, a unit increase in the independent variable is associated with a decrease in the odds of the dependent being 1. 20

II. DISTRIBUTIONS distribution normal distribution. standard scores

II. DISTRIBUTIONS distribution normal distribution. standard scores Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information

Introduction to Quantitative Methods

Introduction to Quantitative Methods Introduction to Quantitative Methods October 15, 2009 Contents 1 Definition of Key Terms 2 2 Descriptive Statistics 3 2.1 Frequency Tables......................... 4 2.2 Measures of Central Tendencies.................

More information

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate

More information

Introduction to Statistics and Quantitative Research Methods

Introduction to Statistics and Quantitative Research Methods Introduction to Statistics and Quantitative Research Methods Purpose of Presentation To aid in the understanding of basic statistics, including terminology, common terms, and common statistical methods.

More information

Statistics. Measurement. Scales of Measurement 7/18/2012

Statistics. Measurement. Scales of Measurement 7/18/2012 Statistics Measurement Measurement is defined as a set of rules for assigning numbers to represent objects, traits, attributes, or behaviors A variableis something that varies (eye color), a constant does

More information

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS

Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE STATISTICS The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses. DESCRIPTIVE VS. INFERENTIAL STATISTICS Descriptive To organize,

More information

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Binary Logistic Regression

Binary Logistic Regression Binary Logistic Regression Main Effects Model Logistic regression will accept quantitative, binary or categorical predictors and will code the latter two in various ways. Here s a simple model including

More information

LOGISTIC REGRESSION ANALYSIS

LOGISTIC REGRESSION ANALYSIS LOGISTIC REGRESSION ANALYSIS C. Mitchell Dayton Department of Measurement, Statistics & Evaluation Room 1230D Benjamin Building University of Maryland September 1992 1. Introduction and Model Logistic

More information

Biostatistics: Types of Data Analysis

Biostatistics: Types of Data Analysis Biostatistics: Types of Data Analysis Theresa A Scott, MS Vanderbilt University Department of Biostatistics theresa.scott@vanderbilt.edu http://biostat.mc.vanderbilt.edu/theresascott Theresa A Scott, MS

More information

DATA INTERPRETATION AND STATISTICS

DATA INTERPRETATION AND STATISTICS PholC60 September 001 DATA INTERPRETATION AND STATISTICS Books A easy and systematic introductory text is Essentials of Medical Statistics by Betty Kirkwood, published by Blackwell at about 14. DESCRIPTIVE

More information

SPSS Explore procedure

SPSS Explore procedure SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

Statistics Review PSY379

Statistics Review PSY379 Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses

More information

WHAT IS A JOURNAL CLUB?

WHAT IS A JOURNAL CLUB? WHAT IS A JOURNAL CLUB? With its September 2002 issue, the American Journal of Critical Care debuts a new feature, the AJCC Journal Club. Each issue of the journal will now feature an AJCC Journal Club

More information

Simple Linear Regression Inference

Simple Linear Regression Inference Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation

More information

Study Guide for the Final Exam

Study Guide for the Final Exam Study Guide for the Final Exam When studying, remember that the computational portion of the exam will only involve new material (covered after the second midterm), that material from Exam 1 will make

More information

Statistical tests for SPSS

Statistical tests for SPSS Statistical tests for SPSS Paolo Coletti A.Y. 2010/11 Free University of Bolzano Bozen Premise This book is a very quick, rough and fast description of statistical tests and their usage. It is explicitly

More information

11. Analysis of Case-control Studies Logistic Regression

11. Analysis of Case-control Studies Logistic Regression Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:

More information

Data analysis process

Data analysis process Data analysis process Data collection and preparation Collect data Prepare codebook Set up structure of data Enter data Screen data for errors Exploration of data Descriptive Statistics Graphs Analysis

More information

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES SCHOOL OF HEALTH AND HUMAN SCIENCES Using SPSS Topics addressed today: 1. Differences between groups 2. Graphing Use the s4data.sav file for the first part of this session. DON T FORGET TO RECODE YOUR

More information

Association Between Variables

Association Between Variables Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi

More information

Descriptive Statistics and Measurement Scales

Descriptive Statistics and Measurement Scales Descriptive Statistics 1 Descriptive Statistics and Measurement Scales Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample

More information

Multivariate Analysis of Variance. The general purpose of multivariate analysis of variance (MANOVA) is to determine

Multivariate Analysis of Variance. The general purpose of multivariate analysis of variance (MANOVA) is to determine 2 - Manova 4.3.05 25 Multivariate Analysis of Variance What Multivariate Analysis of Variance is The general purpose of multivariate analysis of variance (MANOVA) is to determine whether multiple levels

More information

Module 3: Correlation and Covariance

Module 3: Correlation and Covariance Using Statistical Data to Make Decisions Module 3: Correlation and Covariance Tom Ilvento Dr. Mugdim Pašiƒ University of Delaware Sarajevo Graduate School of Business O ften our interest in data analysis

More information

SPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011

SPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011 SPSS ADVANCED ANALYSIS WENDIANN SETHI SPRING 2011 Statistical techniques to be covered Explore relationships among variables Correlation Regression/Multiple regression Logistic regression Factor analysis

More information

Module 5: Multiple Regression Analysis

Module 5: Multiple Regression Analysis Using Statistical Data Using to Make Statistical Decisions: Data Multiple to Make Regression Decisions Analysis Page 1 Module 5: Multiple Regression Analysis Tom Ilvento, University of Delaware, College

More information

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1) CORRELATION AND REGRESSION / 47 CHAPTER EIGHT CORRELATION AND REGRESSION Correlation and regression are statistical methods that are commonly used in the medical literature to compare two or more variables.

More information

The Dummy s Guide to Data Analysis Using SPSS

The Dummy s Guide to Data Analysis Using SPSS The Dummy s Guide to Data Analysis Using SPSS Mathematics 57 Scripps College Amy Gamble April, 2001 Amy Gamble 4/30/01 All Rights Rerserved TABLE OF CONTENTS PAGE Helpful Hints for All Tests...1 Tests

More information

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Descriptive Statistics Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion Statistics as a Tool for LIS Research Importance of statistics in research

More information

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics. Business Course Text Bowerman, Bruce L., Richard T. O'Connell, J. B. Orris, and Dawn C. Porter. Essentials of Business, 2nd edition, McGraw-Hill/Irwin, 2008, ISBN: 978-0-07-331988-9. Required Computing

More information

When to Use a Particular Statistical Test

When to Use a Particular Statistical Test When to Use a Particular Statistical Test Central Tendency Univariate Descriptive Mode the most commonly occurring value 6 people with ages 21, 22, 21, 23, 19, 21 - mode = 21 Median the center value the

More information

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY

Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY Biostatistics: DESCRIPTIVE STATISTICS: 2, VARIABILITY 1. Introduction Besides arriving at an appropriate expression of an average or consensus value for observations of a population, it is important to

More information

Organizing Your Approach to a Data Analysis

Organizing Your Approach to a Data Analysis Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize

More information

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013 Statistics I for QBIC Text Book: Biostatistics, 10 th edition, by Daniel & Cross Contents and Objectives Chapters 1 7 Revised: August 2013 Chapter 1: Nature of Statistics (sections 1.1-1.6) Objectives

More information

Chapter 5 Analysis of variance SPSS Analysis of variance

Chapter 5 Analysis of variance SPSS Analysis of variance Chapter 5 Analysis of variance SPSS Analysis of variance Data file used: gss.sav How to get there: Analyze Compare Means One-way ANOVA To test the null hypothesis that several population means are equal,

More information

Canonical Correlation Analysis

Canonical Correlation Analysis Canonical Correlation Analysis LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the similarities and differences between multiple regression, factor analysis,

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS This booklet contains lecture notes for the nonparametric work in the QM course. This booklet may be online at http://users.ox.ac.uk/~grafen/qmnotes/index.html.

More information

Descriptive Statistics

Descriptive Statistics Y520 Robert S Michael Goal: Learn to calculate indicators and construct graphs that summarize and describe a large quantity of values. Using the textbook readings and other resources listed on the web

More information

Chapter 15. Mixed Models. 15.1 Overview. A flexible approach to correlated data.

Chapter 15. Mixed Models. 15.1 Overview. A flexible approach to correlated data. Chapter 15 Mixed Models A flexible approach to correlated data. 15.1 Overview Correlated data arise frequently in statistical analyses. This may be due to grouping of subjects, e.g., students within classrooms,

More information

Ordinal Regression. Chapter

Ordinal Regression. Chapter Ordinal Regression Chapter 4 Many variables of interest are ordinal. That is, you can rank the values, but the real distance between categories is unknown. Diseases are graded on scales from least severe

More information

Premaster Statistics Tutorial 4 Full solutions

Premaster Statistics Tutorial 4 Full solutions Premaster Statistics Tutorial 4 Full solutions Regression analysis Q1 (based on Doane & Seward, 4/E, 12.7) a. Interpret the slope of the fitted regression = 125,000 + 150. b. What is the prediction for

More information

Multivariate Analysis of Variance (MANOVA)

Multivariate Analysis of Variance (MANOVA) Chapter 415 Multivariate Analysis of Variance (MANOVA) Introduction Multivariate analysis of variance (MANOVA) is an extension of common analysis of variance (ANOVA). In ANOVA, differences among various

More information

Principles of Hypothesis Testing for Public Health

Principles of Hypothesis Testing for Public Health Principles of Hypothesis Testing for Public Health Laura Lee Johnson, Ph.D. Statistician National Center for Complementary and Alternative Medicine johnslau@mail.nih.gov Fall 2011 Answers to Questions

More information

Projects Involving Statistics (& SPSS)

Projects Involving Statistics (& SPSS) Projects Involving Statistics (& SPSS) Academic Skills Advice Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs,

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the

More information

UNDERSTANDING THE TWO-WAY ANOVA

UNDERSTANDING THE TWO-WAY ANOVA UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables

More information

NCSS Statistical Software

NCSS Statistical Software Chapter 06 Introduction This procedure provides several reports for the comparison of two distributions, including confidence intervals for the difference in means, two-sample t-tests, the z-test, the

More information

MBA 611 STATISTICS AND QUANTITATIVE METHODS

MBA 611 STATISTICS AND QUANTITATIVE METHODS MBA 611 STATISTICS AND QUANTITATIVE METHODS Part I. Review of Basic Statistics (Chapters 1-11) A. Introduction (Chapter 1) Uncertainty: Decisions are often based on incomplete information from uncertain

More information

Logistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests

Logistic Regression. http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Logistic Regression http://faculty.chass.ncsu.edu/garson/pa765/logistic.htm#sigtests Overview Binary (or binomial) logistic regression is a form of regression which is used when the dependent is a dichotomy

More information

Multiple Regression: What Is It?

Multiple Regression: What Is It? Multiple Regression Multiple Regression: What Is It? Multiple regression is a collection of techniques in which there are multiple predictors of varying kinds and a single outcome We are interested in

More information

January 26, 2009 The Faculty Center for Teaching and Learning

January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS A USER GUIDE January 26, 2009 The Faculty Center for Teaching and Learning THE BASICS OF DATA MANAGEMENT AND ANALYSIS Table of Contents Table of Contents... i

More information

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number 1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number A. 3(x - x) B. x 3 x C. 3x - x D. x - 3x 2) Write the following as an algebraic expression

More information

Normality Testing in Excel

Normality Testing in Excel Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com

More information

6.4 Normal Distribution

6.4 Normal Distribution Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

Moderation. Moderation

Moderation. Moderation Stats - Moderation Moderation A moderator is a variable that specifies conditions under which a given predictor is related to an outcome. The moderator explains when a DV and IV are related. Moderation

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Descriptive Analysis

Descriptive Analysis Research Methods William G. Zikmund Basic Data Analysis: Descriptive Statistics Descriptive Analysis The transformation of raw data into a form that will make them easy to understand and interpret; rearranging,

More information

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics Course Text Business Statistics Lind, Douglas A., Marchal, William A. and Samuel A. Wathen. Basic Statistics for Business and Economics, 7th edition, McGraw-Hill/Irwin, 2010, ISBN: 9780077384470 [This

More information

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name:

Good luck! BUSINESS STATISTICS FINAL EXAM INSTRUCTIONS. Name: Glo bal Leadership M BA BUSINESS STATISTICS FINAL EXAM Name: INSTRUCTIONS 1. Do not open this exam until instructed to do so. 2. Be sure to fill in your name before starting the exam. 3. You have two hours

More information

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test

Bivariate Statistics Session 2: Measuring Associations Chi-Square Test Bivariate Statistics Session 2: Measuring Associations Chi-Square Test Features Of The Chi-Square Statistic The chi-square test is non-parametric. That is, it makes no assumptions about the distribution

More information

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection Chapter 311 Introduction Often, theory and experience give only general direction as to which of a pool of candidate variables (including transformed variables) should be included in the regression model.

More information

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model Assumptions Assumptions of linear models Apply to response variable within each group if predictor categorical Apply to error terms from linear model check by analysing residuals Normality Homogeneity

More information

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS SEEMA JAGGI Indian Agricultural Statistics Research Institute Library Avenue, New Delhi - 110 012 seema@iasri.res.in 1. Descriptive Statistics Statistics

More information

How To Write A Data Analysis

How To Write A Data Analysis Mathematics Probability and Statistics Curriculum Guide Revised 2010 This page is intentionally left blank. Introduction The Mathematics Curriculum Guide serves as a guide for teachers when planning instruction

More information

Multivariate Analysis. Overview

Multivariate Analysis. Overview Multivariate Analysis Overview Introduction Multivariate thinking Body of thought processes that illuminate the interrelatedness between and within sets of variables. The essence of multivariate thinking

More information

Foundation of Quantitative Data Analysis

Foundation of Quantitative Data Analysis Foundation of Quantitative Data Analysis Part 1: Data manipulation and descriptive statistics with SPSS/Excel HSRS #10 - October 17, 2013 Reference : A. Aczel, Complete Business Statistics. Chapters 1

More information

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.

Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing. Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative

More information

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING

More information

The Statistics Tutor s Quick Guide to

The Statistics Tutor s Quick Guide to statstutor community project encouraging academics to share statistics support resources All stcp resources are released under a Creative Commons licence The Statistics Tutor s Quick Guide to Stcp-marshallowen-7

More information

The Chi-Square Test. STAT E-50 Introduction to Statistics

The Chi-Square Test. STAT E-50 Introduction to Statistics STAT -50 Introduction to Statistics The Chi-Square Test The Chi-square test is a nonparametric test that is used to compare experimental results with theoretical models. That is, we will be comparing observed

More information

How To Check For Differences In The One Way Anova

How To Check For Differences In The One Way Anova MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way

More information

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could

More information

DISCRIMINANT FUNCTION ANALYSIS (DA)

DISCRIMINANT FUNCTION ANALYSIS (DA) DISCRIMINANT FUNCTION ANALYSIS (DA) John Poulsen and Aaron French Key words: assumptions, further reading, computations, standardized coefficents, structure matrix, tests of signficance Introduction Discriminant

More information

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4) Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume

More information

THE KRUSKAL WALLLIS TEST

THE KRUSKAL WALLLIS TEST THE KRUSKAL WALLLIS TEST TEODORA H. MEHOTCHEVA Wednesday, 23 rd April 08 THE KRUSKAL-WALLIS TEST: The non-parametric alternative to ANOVA: testing for difference between several independent groups 2 NON

More information

Examining Differences (Comparing Groups) using SPSS Inferential statistics (Part I) Dwayne Devonish

Examining Differences (Comparing Groups) using SPSS Inferential statistics (Part I) Dwayne Devonish Examining Differences (Comparing Groups) using SPSS Inferential statistics (Part I) Dwayne Devonish Statistics Statistics are quantitative methods of describing, analysing, and drawing inferences (conclusions)

More information

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics Analysis of Data Claudia J. Stanny PSY 67 Research Design Organizing Data Files in SPSS All data for one subject entered on the same line Identification data Between-subjects manipulations: variable to

More information

One-Way ANOVA using SPSS 11.0. SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

One-Way ANOVA using SPSS 11.0. SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate 1 One-Way ANOVA using SPSS 11.0 This section covers steps for testing the difference between three or more group means using the SPSS ANOVA procedures found in the Compare Means analyses. Specifically,

More information

Research Methods & Experimental Design

Research Methods & Experimental Design Research Methods & Experimental Design 16.422 Human Supervisory Control April 2004 Research Methods Qualitative vs. quantitative Understanding the relationship between objectives (research question) and

More information

UNIVERSITY OF NAIROBI

UNIVERSITY OF NAIROBI UNIVERSITY OF NAIROBI MASTERS IN PROJECT PLANNING AND MANAGEMENT NAME: SARU CAROLYNN ELIZABETH REGISTRATION NO: L50/61646/2013 COURSE CODE: LDP 603 COURSE TITLE: RESEARCH METHODS LECTURER: GAKUU CHRISTOPHER

More information

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS CHAPTER 7B Multiple Regression: Statistical Methods Using IBM SPSS This chapter will demonstrate how to perform multiple linear regression with IBM SPSS first using the standard method and then using the

More information

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS About Omega Statistics Private practice consultancy based in Southern California, Medical and Clinical

More information

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals

More information

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September

More information

Multinomial and Ordinal Logistic Regression

Multinomial and Ordinal Logistic Regression Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,

More information

Instructions for SPSS 21

Instructions for SPSS 21 1 Instructions for SPSS 21 1 Introduction... 2 1.1 Opening the SPSS program... 2 1.2 General... 2 2 Data inputting and processing... 2 2.1 Manual input and data processing... 2 2.2 Saving data... 3 2.3

More information

Basic Concepts in Research and Data Analysis

Basic Concepts in Research and Data Analysis Basic Concepts in Research and Data Analysis Introduction: A Common Language for Researchers...2 Steps to Follow When Conducting Research...3 The Research Question... 3 The Hypothesis... 4 Defining the

More information

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest

Analyzing Intervention Effects: Multilevel & Other Approaches. Simplest Intervention Design. Better Design: Have Pretest Analyzing Intervention Effects: Multilevel & Other Approaches Joop Hox Methodology & Statistics, Utrecht Simplest Intervention Design R X Y E Random assignment Experimental + Control group Analysis: t

More information