SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg



Similar documents
Sun Li Centre for Academic Computing

Analysis of algorithms of time series analysis for forecasting sales

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

IBM SPSS Forecasting 22

A Basic Introduction to Missing Data

Statistics in Retail Finance. Chapter 6: Behavioural models

TIME SERIES ANALYSIS

HLM software has been one of the leading statistical packages for hierarchical

Missing data and net survival analysis Bernard Rachet

Time Series Analysis

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

IBM SPSS Missing Values 22

Tips for surviving the analysis of survival data. Philip Twumasi-Ankrah, PhD

Regression Modeling Strategies

TIME SERIES ANALYSIS

Time Series - ARIMA Models. Instructor: G. William Schwert

APPLIED MISSING DATA ANALYSIS

Data analysis process

Time Series Analysis: Basic Forecasting.

Advanced Forecasting Techniques and Models: ARIMA

STATISTICA Formula Guide: Logistic Regression. Table of Contents

SUMAN DUVVURU STAT 567 PROJECT REPORT

How To Model A Series With Sas

Directions for using SPSS

Data Cleaning and Missing Data Analysis

Problem of Missing Data

SPSS Introduction. Yi Li

MGT 267 PROJECT. Forecasting the United States Retail Sales of the Pharmacies and Drug Stores. Done by: Shunwei Wang & Mohammad Zainal

Simple Predictive Analytics Curtis Seare

Longitudinal Data Analysis. Wiley Series in Probability and Statistics

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Missing Data. A Typology Of Missing Data. Missing At Random Or Not Missing At Random

Introduction to Longitudinal Data Analysis

Handling attrition and non-response in longitudinal data

Energy Load Mining Using Univariate Time Series Analysis

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Statistics Graduate Courses

Using JMP Version 4 for Time Series Analysis Bill Gjertsen, SAS, Cary, NC

A REVIEW OF CURRENT SOFTWARE FOR HANDLING MISSING DATA

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Joseph Twagilimana, University of Louisville, Louisville, KY

16 : Demand Forecasting

Forecasting Geographic Data Michael Leonard and Renee Samy, SAS Institute Inc. Cary, NC, USA

CHAPTER 9 EXAMPLES: MULTILEVEL MODELING WITH COMPLEX SURVEY DATA

Easily Identify Your Best Customers

UNDERGRADUATE DEGREE DETAILS : BACHELOR OF SCIENCE WITH

Univariate and Multivariate Methods PEARSON. Addison Wesley

IBM SPSS Forecasting 21

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Chapter 25 Specifying Forecasting Models

Module 6: Introduction to Time Series Forecasting

Better decision making under uncertain conditions using Monte Carlo Simulation

Dealing with Missing Data

SPSS TUTORIAL & EXERCISE BOOK

Dealing with Missing Data

Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models

Linear Mixed-Effects Modeling in SPSS: An Introduction to the MIXED Procedure

Multiple Imputation for Missing Data: A Cautionary Tale

IBM SPSS Direct Marketing 22

IBM SPSS Missing Values 20

Linda K. Muthén Bengt Muthén. Copyright 2008 Muthén & Muthén Table Of Contents

Promotional Forecast Demonstration

Time Series Analysis of Aviation Data

Review of the Methods for Handling Missing Data in. Longitudinal Data Analysis

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Imputing Missing Data using SAS

Missing Data & How to Deal: An overview of missing data. Melissa Humphries Population Research Center

Time Series Analysis and Forecasting Methods for Temporal Mining of Interlinked Documents

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Predicting Customer Churn in the Telecommunications Industry An Application of Survival Analysis Modeling Using SAS

IBM SPSS Direct Marketing 19

Spreadsheet software for linear regression analysis

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

Threshold Autoregressive Models in Finance: A Comparative Approach

Introduction to mixed model and missing data issues in longitudinal studies

IBM SPSS Direct Marketing 23

Time Series Laboratory

Generalized Linear Models

Chapter 27 Using Predictor Variables. Chapter Table of Contents

COMP6053 lecture: Time series analysis, autocorrelation.

Moderation. Moderation

BayesX - Software for Bayesian Inference in Structured Additive Regression

Additional sources Compilation of sources:

Binary Logistic Regression

Students' Opinion about Universities: The Faculty of Economics and Political Science (Case Study)

Modeling Customer Lifetime Value Using Survival Analysis An Application in the Telecommunications Industry

Gamma Distribution Fitting

Missing data: the hidden problem

Introduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group

Imputing Attendance Data in a Longitudinal Multilevel Panel Data Set

January 26, 2009 The Faculty Center for Teaching and Learning

CHAPTER 12 EXAMPLES: MONTE CARLO SIMULATION STUDIES

SAS Certificate Applied Statistics and SAS Programming

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

Lecture 15 Introduction to Survival Analysis

Readers will be provided a link to download the software and Excel files that are used in the book after payment. Please visit

Transcription:

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg

IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way ANOVA Non-parametric Tests Correlations General Linear Regression- GLM Univariate procedure Logistic Models Binary Logistic Model Multinomial Logistic Model Ordinal Logistic Model 2

OUTLINE Curve Estimation Non-linear Regression Missing Value Analysis & Multiple Imputation Survival Analysis Mixed Linear Model Time Series Analysis & Forecasting (self-study) 3

CURVE ESTIMATION Using Curve Estimation to Model the Law of Diminishing Returns when the relationship between the dependent variable(s) and the independent variable is not necessarily linear. Analyze Regression Curve Estimation Example: advert.sav A retailer wants to examine the relationship between money spent on advertising and the resulting sales. To this end, they have collected past sales figures and the associated advertising costs. 4

CURVE ESTIMATION 5

CURVE ESTIMATION 6

CURVE ESTIMATION Use chart builder to plot diagnostic graphs. 7

NON-LINEAR REGRESSION Using Nonlinear Regression to Model the Law of Diminishing Returns when the relationship between the dependent and independent variables is not intrinsically linear. Example: advert.sav Analyze Regression Nonlinear 8

NON-LINEAR REGRESSION The asymptotic regression model: y b 1 b 2 exp( b 3 x) When b1>0, b2<0, and b3<0, it gives Mistcherlich's model of the "law of diminishing returns". This model initially increases quickly with increasing values of x, but then the gains slow and finally taper off just below the value b1. Starting values: b1 represents the upper asymptote for sales. Looking at the chart, even the largest sales values fall just short of 13, so that's a reasonable starting value. b2 is the difference between the value of y when x=0 and the upper asymptote. A reasonable starting value is the minimum value of y minus b1. Looking at the chart, say that's about 7-13=-6. b3 can be roughly initially estimated by the negative of the slope between two "well separated" points on the plot. Looking at the chart there are a few points about x=2, y=8, and about x=5, y=12. The slope between these points is (12-8)/(5-2)=1.33, thus a rough initial estimate for b3 is -1.33. 9

NON-LINEAR REGRESSION 10

NON-LINEAR REGRESSION 11

MISSING VALUE ANALYSIS (MVA) Data Imputation for Missing Value Usisng MVA Procedure Types of Missingness: Missing completely at random (MCAR): exists when missing values are randomly distributed across all cases. SPSS MVA procedure supports Little s MCAR test for MCAR. Missing at random (MAR): exists when missing values are not randomly distributed across all cases but are randomly distributed within one or more subsamples. SPSS MVA generates a table of Separate Variance t Tests. Significant result in any cell means that missing cases in the row variable are significantly correlated with the column variable and thus are not MAR. Non-ignorable missingness: exists when missing values are not randomly distributed across cases, but the probability of missingness cannot be predicted from the variables in the model. It is the most problematic form. 12

MISSING VALUE ANALYSIS Estimation methods: Pattern analysis: describes the pattern of missing data. Listwise / pairwise deletion: both deletion methods assume MCAR. Mean substitution: was once popular but is no longer prefered. Multiple regression: uses non-missing data to predict the values of missing data. Maximum likelihood estimation (MLE): makes few demand of the data in terms of statistical assumptions and is generally considered superior to imputation by multiple regression. This is now the most common method of imputation. Approximate Bayesian bootstrap (ABB): uses logistic regression to estimate the probability of response/non-response based on covariates. SPSS does not yet support ABB. Multiple imputation: generates multiple simulated values for each incomplete datum, then iteratively analyzing datasets with each simulated value substituted in turn. 13

MISSING VALUE ANALYSIS When to use MVA: As a rule of thumb, if a variable has more than 5% missing values, cases are not deleted, and many researchers are much more stringent than this. It is not recommended for multivariate analysis, as imputation can distort coefficients of association and correlation relating variables in multivariate analysis. If researchers are still not sure whether to apply MVA, it is recommended running all analysis on both the original and imputed datasets, and discussing where imputation would make a difference for the substantive (not mealy statistical) interpretations. 14

MISSING VALUE ANALYSIS Using SPSS Multiple Imputation of Missing Values: Step 1: Describe the pattern of missing data Analyze Missing Value Analysis Example: telco_missing.sav A telecommunications provider wants to better understand service usage patterns in its customer database. The company wants to ensure that the data are missing completely at random before running further analyses. 15

MISSING VALUE ANALYSIS 16

MISSING VALUE ANALYSIS 17

MISSING VALUE ANALYSIS 18

MULTIPLE IMPUTATION Step 2: Multiple imputation Analyze Multiple Imputation Analyze Patterns 19

MULTIPLE IMPUTATION 20

MULTIPLE IMPUTATION Analyze Multiple Imputation Impute Missing Data Values 21

MULTIPLE IMPUTATION 22

MULTIPLE IMPUTATION Step 3: Run analysis using complete data e.g: Multinomial logistic model with dependent variable custcat: customer categories. 23

SURVIVAL ANALYSIS Survival Data Data: survival data is time-to-event data. It s quantitative data corresponding to time from a well-defined time origin till the occurrence of some particular event of interest or endpoint. Reason of using survival model: The distribution of survival data tends to be positively skewed and not likely to be normal distribution and it may not be possible to find a transformation. Time-varying covariates could not be handled. In addition, some duration is censored. Censored observations: could be the event has not occurred at endpoint; lost to follow-up; withdraws from study; other interventions offered; event occurred but for unrelated cause; etc. 24

SURVIVAL ANALYSIS Survival Model Survival function: S( t) P( T t) 1 F( t) Hazard function: f ( t) d log( S( t)) h( t) h( t) S( t) dt S( t) exp( H( t)) H(t) is cumulative hazard function. 25

SURVIVAL ANALYSIS Kaplan-Meier Estimator: Sˆ( t) t j t ( j ) t t d (1 n j j ) ( 1) (2)... ( n) t The number of individuals who experience the event at time t The number of individuals who have not yet experienced the event at time t Cox Regression: h ( t) h i 0 log( H T ( t)exp( x i ( t)) log i H ) 0 S T ( t) x i ( t) i S 0 ( t) exp( T x i ) h 0 ( t) is the baseline hazard function. T exp( ( x x i j )) is the hazard ratio (HR) or incident rate ratio. 26

SURVIVAL ANALYSIS Example: telco.sav As part of its efforts to reduce customer churn, a telecommunications company is interested in examining the "time to churn". Variable name age Variable information Age in years marital Marital status 0=unmarried 1=married address income ed employ reside Years in current address Household income in thousands Level of educations 1= didn t complete high school 2= high school degree 3= college degree 4= undergraduate 5= postgraduate Years with current employer Number of people in household gender Gender 0=male 1=female tenure churn custcat Months with service Churn within last month 0 = No 1= Yes Customer categories 1= basic service 2= E-service 3= plus service 4=total service 27

SURVIVAL ANALYSIS Life table Analyze Survival Life Tables 28

SURVIVAL ANALYSIS 29

SURVIVAL ANALYSIS Cox regression Analyze Survival Cox Regression 30

SURVIVAL ANALYSIS 31

SURVIVAL ANALYSIS 32

MIXED LINEAR MODEL The mixed linear model Factors. Categorical predictors should be selected as factors in the model. Each level of a factor can have a different linear effect on the value of the dependent variable. Fixed-effects factors are generally thought of as variables whose values of interest are all represented in the data file. Random-effects factors are variables whose values in the data file can be considered a random sample from a larger population of values. They are useful for explaining excess variability in the dependent variable. Covariates. Scale predictors should be selected as covariates in the model. Within combinations of factor levels (or cells), values of covariates are assumed to be linearly correlated with values of the dependent variables. Random effects covariance structure. SPSS Mixed Linear Model procedure allows you to specify the relationship between the levels of random effects. By default, levels of random effects are uncorrelated and have the same variance (Univariate linear). 33

MIXED LINEAR MODEL Repeated effects. It allows you to relax the assumption of independence of the error terms. In order to model the covariance structure of the error terms, you need to specify the following: Repeated effects variables are variables whose values in the data file can be considered as markers of multiple observations of a single subject. Subject variables define the individual subjects of the repeated measurements. The error terms for each individual are independent of those of other individuals. Covariance structure specifies the relationship between the levels of the repeated effects. The types of covariance structures available allow for residual terms with a wide variety of variances and covariances. Hierarchical notation: Mixed model notation: Y ij 0 j 1 j 0 j 00 10 1 j 01 11 X Z Z ij j j r u ij u 0 j 1 j Y ij 00 11 Z j X 01 ij Z j u 0 j 10 X u ij 1 j X ij r ij 34

MIXED LINEAR MODEL Using Mixed Linear Model to Model Random Effects and Repeated Measures Example: testmarket.sav A fast food chain plans to add a new item to its menu. However, they are still undecided between three possible campaigns for promoting the new product. In order to determine which promotion has the greatest effect on sales, the new item is introduced at locations in several randomly selected markets. A different promotion is used at each location, and the weekly sales of the new item are recorded for the first four weeks. Variable name marketid mktsize locid ageloc promo Variable information Market ID Market size 1 = small 2 = median 3 = large Location ID Age of store location Promotion types week Week: week 1, 2, 3, 4 sales Units sold in thousands 35

MIXED LINEAR MODEL Data structure for mixed models: 36

MIXED LINEAR MODEL Analyze Mixed Models Linear 37

MIXED LINEAR MODEL 38

39

TIME SERIES ANALYSIS Definitions, Applications and Techniques Time series data: each case represents a point in time. Each cell gives a value for each variable for each time period. Stationarity: Data are stationary. A stationary process has the property that the mean, variance and autocorrelation structure do not change over time. Seasonality: By seasonality, we mean periodic fluctuations. The usage of time series models is: to obtain an understanding of underlying forces and structures that produce the observed data. to fit a model and proceed to forecasting and monitoring. Techniques: Exponential Smoothing ARIMA Models 40

TIME SERIES ANALYSIS Exponential Smoothing Four available model types: Simple. The simple model assumes that the series has no trend and no seasonal variation. Holt. The Holt model assumes that the series has a linear trend and no seasonal variation. Winters. The Winters model assumes that the series has a linear trend and multiplicative seasonal variation (its magnitude increases or decreases with the overall level of the series). Custom. A custom model allows you to specify the trend and seasonality components. 41

ARIMA Model ARIMA(p, d, q) (P, D, Q) Autoregression (AR): p is the order of autoregression Integration (I): d is the order of integration (differencing) Moving-Average (MA): q is the order of moving-average AR(p) model: MA(q) model: ARIMA(p, d, q) model: (P,D,Q) are their seasonal counterparts. t p t p t t t A X X X X... 2 2 1 1 q t q t t t t A A A A X... 2 2 1 1 t q i i i t d p i i i A L X L L ) (1 ) )(1 (1 1 1 42 TIME SERIES ANALYSIS

TIME SERIES ANALYSIS Example: catalog_seasfac.sav A catalog company, interested in developing a forecasting model, has collected data on monthly sales of men's clothing along with several series that might be used to explain some of the variation in sales. Possible predictors include the number of catalogs mailed, the number of pages in the catalog, the number of phone lines open for ordering, the amount spent on print advertising, and the number of customer service representatives. Variable name date men mail page phone print service Variable information Date Sales of Men s clothing Number of catalogs mailed Number of pages in catalog Number of phone lines open for ordering Amount spent on print advertising Number of customer service representatives 43

TIME SERIES ANALYSIS Step 1: to draw a sequence chart to identify potential seasonality Analyze Forecasting Sequence Charts 44

TIME SERIES ANALYSIS Step 2: to build the model with the Expert Modeler Analyze Forecasting Create Models 45

TIME SERIES ANALYSIS 46

TIME SERIES ANALYSIS 47

TIME SERIES ANALYSIS 48

TIME SERIES ANALYSIS Step 3: to make prediction by applying saved models One way to make prediction is to save the predicted values and set the forecast period when constructing model: Recall Create Models. 49

TIME SERIES ANALYSIS The other way to make prediction is to save the model structure and apply the saved model to data for forecasting. Analyze Forecasting Apply Models 50

TIME SERIES ANALYSIS Norusis, M. 2004. SPSS 13.0 Advanced Statistical Procedures Companion. Upper Saddle-River, N.J.: Prentice Hall, Inc.. Bates, D. M., and D. G. Watts. 1988. Nonlinear Regression Analysis and its Applications. New York: John Wiley and Sons. Hosmer, D. W., and S. Lemeshow. 1999. Applied Survival Analysis. New York: John Wiley and Sons. Brown, H., and R. Prescott. 1999. Applied mixed models in medicine. New York: John Wiley and Sons. Verbeke, G., and G. Molenberghs. 2000. Linear mixed models for longitudinal data. New York: Springer-Verlag. 51

THANKS! CAC statistical WIKI page: http://research2.smu.edu.sg/cac/statisticalcomputing/wiki/spss.aspx Statistical consultation service: lsun@smu.edu.sg 52