Data Analysis: Analyzing Data - Inferential Statistics



Similar documents
When to Use a Particular Statistical Test

Statistics Review PSY379

II. DISTRIBUTIONS distribution normal distribution. standard scores

Description. Textbook. Grading. Objective

Introduction to Statistics and Quantitative Research Methods

Recall this chart that showed how most of our course would be organized:

Module 3: Correlation and Covariance

Executive Summary. 1. What is the temporal relationship between problem gambling and other co-occurring disorders?

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Chi Square Tests. Chapter Introduction

Simple Linear Regression Inference

Additional sources Compilation of sources:

A Primer on Forecasting Business Performance

Discussion of State Appropriation to Eastern Michigan University August 2013

Biostatistics: Types of Data Analysis

CHAPTER 15 NOMINAL MEASURES OF CORRELATION: PHI, THE CONTINGENCY COEFFICIENT, AND CRAMER'S V

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

430 Statistics and Financial Mathematics for Business

Logic Models, Human Service Programs, and Performance Measurement

Introduction to Longitudinal Data Analysis

UNDERSTANDING THE TWO-WAY ANOVA

Association Between Variables

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Inferential Statistics. What are they? When would you use them?

IPDET Module 6: Descriptive, Normative, and Impact Evaluation Designs

Week 1. Exploratory Data Analysis

Credit Risk Models. August 24 26, 2010

Data, Measurements, Features

Lecture 1: Review and Exploratory Data Analysis (EDA)

ASC 076 INTRODUCTION TO SOCIAL AND CRIMINAL PSYCHOLOGY


Change Management Dashboard: An Adaptive Approach to Lead a Change Program

International Statistical Institute, 56th Session, 2007: Phil Everson

Parametric and Nonparametric: Demystifying the Terms

Data Analysis Tools. Tools for Summarizing Data

BASIC COURSE IN STATISTICS FOR MEDICAL DOCTORS, SCIENTISTS & OTHER RESEARCH INVESTIGATORS INFORMATION BROCHURE

Statistics, Research, & SPSS: The Basics

Analyzing Research Data Using Excel

Organizing Your Approach to a Data Analysis

IBM SPSS Direct Marketing 23

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

IBM SPSS Direct Marketing 22

Profile of Rural Health Insurance Coverage

Chi Square Distribution

Descriptive Statistics

Lecture 2. Summarizing the Sample

Introduction to Quantitative Methods

10. Analysis of Longitudinal Studies Repeat-measures analysis

Working with data: Data analyses April 8, 2014

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Basic Concepts in Research and Data Analysis

PELLISSIPPI STATE COMMUNITY COLLEGE MASTER SYLLABUS INTRODUCTION TO STATISTICS MATH 2050

STATISTICAL ANALYSIS WITH EXCEL COURSE OUTLINE

Multivariate Analysis. Overview

Descriptive Statistics and Measurement Scales

(Following Paper ID and Roll No. to be filled in your Answer Book) Roll No. METHODOLOGY

REGULATIONS FOR THE POSTGRADUATE DIPLOMA IN CLINICAL RESEARCH METHODOLOGY (PDipClinResMethodology)

DATA COLLECTION AND ANALYSIS

UNIVERSITY of MASSACHUSETTS DARTMOUTH Charlton College of Business Decision and Information Sciences Fall 2010

Criminal Justice Professionals Attitudes Towards Offenders: Assessing the Link between Global Orientations and Specific Attributions

Retirement routes and economic incentives to retire: a cross-country estimation approach Martin Rasmussen

MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

MSSW Application Requirements

Module 5: Multiple Regression Analysis

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Comparing Alternate Designs For A Multi-Domain Cluster Sample

We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries?

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Salaries of HIM Professionals

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

Simple Predictive Analytics Curtis Seare

Data Analysis, Research Study Design and the IRB

Demand forecasting & Aggregate planning in a Supply chain. Session Speaker Prof.P.S.Satish

Unit 1: Introduction to Quality Management

Chapter 5 Analysis of variance SPSS Analysis of variance

Survey Research Data Analysis

UNIVERSITY OF NAIROBI

Senate Bill No. 2 CHAPTER 673

What Does a Personality Look Like?

Gerry Hobbs, Department of Statistics, West Virginia University

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

MSW Application Requirements. General Requirements...2. Additional Requirements by Program 4. Admissions Essays...5. International Applicants 7

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

Business Statistics MBA Course Outline

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Handling attrition and non-response in longitudinal data

16 : Demand Forecasting

Multivariate Normal Distribution

MULTIPLE REGRESSION WITH CATEGORICAL DATA

Transcription:

WHAT IT IS Return to Table of ontents WHEN TO USE IT Inferential statistics deal with drawing conclusions and, in some cases, making predictions about the properties of a population based on information obtained from a sample. While descriptive statistics provide information about the central tendency, dispersion, skew, and kurtosis of data, inferential statistics allow making broader statements about the relationships between data. Inferential statistics are frequently used to answer cause-and-effect questions and make predictions. They are also used to investigate differences between and among groups. However, one must understand that inferential statistics by themselves do not prove causality. Such proof is always a function of a given theory, and it is vital that such theory be clearly stated prior to using inferential statistics. Otherwise, their use is little more than a fishing expedition. For example, suppose that statistical methods suggest that on average, men are paid significantly more than women for full-time work. Several competing explanations may exist for this discrepancy. Inferential statistics can provide evidence to prove one theory more accurate than the other. However, any ultimate conclusions about actual causality must come from a theory supported by both the data and sound logic. HOW TO PREPARE IT The following briefly introduces some common techniques of inferential statistics and is intended as a guide for determining when certain techniques may be appropriate. The techniques used generally depend on the kinds of variables involved, i.e. nominal, ordinal, or interval. For further information on and/or assistance with a given technique, refer to the books and in-house support listed in the Resources section at the end of this module. hi-square (P ) tests are used to identify differences between groups when all variables nominal, e.g., gender, ethnicity, salary group, political party affiliation, and so forth. Such tests are normally used with contingency tables which group observations based on common characteristics. For example, suppose one wants to determine if political party affiliation differs among ethnic groups. A contingency table which divides the sample into political parties and ethnic groups could be produced. A P test would tell if the ethnic distribution of the sample indicates differences in party affiliation. As a rule, each cell in a contingency table should have a least five observations. In those cases where this is not possible, the Fisher s Exact Test should replace the P test. Data analysis software will usually warn the user when a cell contains fewer than five observations. Analysis of variance (ANOVA) permits comparison of two or more populations when interval variables are used. ANOVA does this by comparing the dispersion of samples in order to make inferences about their means. ANOVA seeks to answer two basic questions: Texas State Auditor's Office, Methodology Manual, rev. 5/95-1

Accountability Modules Are the means of variables of interest different in different populations? Are the differences in the mean values statistically significant? For example, during the Welfare Reform audit, SAO staff wanted to test whether the incomes of job training program participants and nonparticipants were significantly different. ANOVA was used to do this. Analysis of covariance (AOVA) examines whether or not interval variables move together in ways that are independent of their mean values. Ideally, variables should move independently of one another, regardless of their means. Unfortunately, in the real world, groups of observations usually differ on a number of dimensions, making simple analyses of variance tests problematic since differences in other characteristics could cause observed differences in the values of the variables of interest. For example, suppose auditors/evaluators are comparing the income of job program participants and non-participants. Assuming job program participants were found to have higher incomes, could one conclude the explanation was their participation in the program? Although this may be the case, one cannot yet draw that conclusion. Other distinguishing characteristics of program participants might explain higher income. In order to control for the effects of those other variables, an analysis of covariance is necessary. During the Welfare Reform audit noted above, SAO staff found that the incomes of job training program participants and non-participants differed by gender and by target group classification (federal classification for welfare intervention purposes). An AOVA was done to determine if the wages of program participants and nonparticipants were still significantly different, after controlling for differences in gender and target group classification. Thus, here gender and target group were considered the covariates. orrelation (D), like AOVA, is used to measure the similarity in the changes of values of interval variables but is not influenced by the units of measure. Another advantage of correlation is that it is always bounded by the interval: -1 # D # 1 Here -1 indicates a perfect inverse linear relationship, i.e. y increases uniformly as x decreases, and 1 indicates a perfect direct linear relationship, i.e. x and y move uniformly together. A value of 0 indicates no relationship. Note that correlation can determine that a - Texas State Auditor's Office, Methodology Manual, rev. 5/95

relationship exists between variables but says nothing about the cause or directional effect. For example, a known correlation exists between muggings and ice cream sales. However, one does not cause the other. Rather, a third variable, the warm weather which puts more people on the street both to mug and buy ice cream is a more direct cause of the correlation. As a rule, it is wise to examine the correlations between all variables in a data set. This both warns auditors/evaluators about possible covariations and suggests areas for possible follow-up investigation. Regression analysis is often used to determine the effect of independent variables on a dependent variable. Regression measures the relative impact of each independent variable and is useful in forecasting. It is used most appropriately when both the independent and dependent variables are interval, though some social scientists also use regression on ordinal data. Like correlation, regression analysis assumes that the relationship between variables is linear. Regression analysis permits including multiple independent variables. For example, during the Welfare Reform audit, SAO staff wanted to predict the percentage of JOBS program clients in a given county who had gotten a job. The poverty rate and population density of the county were used as independent variables. These two variables enabled the auditors to predict almost perfectly the percentage of people who got a job. Logistic regression analysis is used to examine relationships between variables when the dependent variable is nominal, even though independent variables are nominal, ordinal, interval, or some mixture thereof. Suppose that one wanted to determine which program interventions were associated with a JOBS Program client's ability to get a job within six months of exiting the program. The outcome variable would be "job" or "no job, clearly a nominal variable. One could then use several independent variables such as GED completion, job training, post-secondary education and the like to predict the odds of getting a job. Such a method was applied to the JOBS Program audit. Discriminant analysis is similar to logistic regression in that the outcome variable is categorical. However, here the independent variables must be interval. In the audit of the Probation System, SAO staff explored how well a probationer's rating on drug abuse severity, social adjustment, and similar characteristics predicted whether or not the probationer committed another crime using continuous ratings. The predictive power of these ratings was just slightly better than chance. Of special interest was the percentage of times the model predicted the Texas State Auditor's Office, Methodology Manual, rev. 5/95-3

Accountability Modules probationer would not commit another crime when he or she actually did. Since that percentage was quite high, the model was a poor one. Factor analysis simultaneously examines multiple variables to determine if they reflect larger underlying dimensions. Factor analysis is commonly used when analyzing data from multi-question surveys to reduce the numerous questions to a smaller set of more global issues. For example, during the Department of Information Resources (DIR) audit, SAO staff administered a survey to agencies that used DIR services. The survey had approximately 30 questions, each of which contained three sub-parts, for a total of 90 objective questions. Reporting the responses to all 90 questions was time-consuming and risked losing continuity. Through factor analysis, the audit team found that the survey could be reduced to five central underlying questions. Thus, in many ways, factor analysis is analogous to the process of "rolling up" audit issues. Forecasting exists in many variations. The predictive power of regression analysis can be an effective forecasting tool, but time series forecasting is more common when time is a significant independent variable. Time series forecasting is based on four components: trends, as indicated by a relatively smooth pattern in the data over a long period of time, e.g., population increase cyclical effects, as indicated by a wave-like pattern in the data moving above and below a trend line of over one year in duration, e.g., recessions seasonal effects, as indicated by changes similar to cyclical effects but over a period of less than one year, e.g., expenditures for sand to apply to roads in icy conditions random variations, as indicated by any change in the data not attributable to any other component. Time series forecasting models can be expressed as an additive or multiplicative function of these four components. Once measurements of the four components exist, they can be fit to a regression line to predict future values. ADVANTAGES Inferential statistics can: provide more detailed information than descriptive statistics yield insight into relationships between variables reveal causes and effects and make predictions generate convincing support for a given theory be generally accepted due to widespread use in business and academia - 4 Texas State Auditor's Office, Methodology Manual, rev. 5/95

DISADVANTAGES Inferential statistics can: be quite difficult to learn and use properly be vulnerable to misuse and abuse depend more on sound theory than on implications of a data set Texas State Auditor's Office, Methodology Manual, rev. 5/95-5