Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Similar documents
Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Getting Correct Results from PROC REG

SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria

1.1. Simple Regression in Excel (Excel 2010).

IAPRI Quantitative Analysis Capacity Building Series. Multiple regression analysis & interpreting results

Multiple Regression in SPSS This example shows you how to perform multiple regression. The basic command is regression : linear.

Random effects and nested models with SAS

Least Squares Estimation

Multiple Linear Regression in Data Mining

Additional sources Compilation of sources:

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

1 Simple Linear Regression I Least Squares Estimation

GLM I An Introduction to Generalized Linear Models

SAS Software to Fit the Generalized Linear Model

Basic Statistical and Modeling Procedures Using SAS

5. Linear Regression

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

SPSS Guide: Regression Analysis

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

STATISTICA Formula Guide: Logistic Regression. Table of Contents

VI. Introduction to Logistic Regression

Chapter 13 Introduction to Linear Regression and Correlation Analysis

1 Theory: The General Linear Model

Statistical Models in R

Statistical Models in R

Simple Linear Regression Inference

Lecture 3: Linear methods for classification

POLYNOMIAL AND MULTIPLE REGRESSION. Polynomial regression used to fit nonlinear (e.g. curvilinear) data into a least squares linear regression model.

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

International Statistical Institute, 56th Session, 2007: Phil Everson

Developing Risk Adjustment Techniques Using the System for Assessing Health Care Quality in the

Interaction effects between continuous variables (Optional)

17. SIMPLE LINEAR REGRESSION II

Introduction to General and Generalized Linear Models

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

An Introduction to Statistical Tests for the SAS Programmer Sara Beck, Fred Hutchinson Cancer Research Center, Seattle, WA

Final Exam Practice Problem Answers

Econometrics Simple Linear Regression

Interaction between quantitative predictors

MULTIPLE LINEAR REGRESSION ANALYSIS USING MICROSOFT EXCEL. by Michael L. Orlov Chemistry Department, Oregon State University (1996)

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Multiple Linear Regression

Statistical Machine Learning

1. The parameters to be estimated in the simple linear regression model Y=α+βx+ε ε~n(0,σ) are: a) α, β, σ b) α, β, ε c) a, b, s d) ε, 0, σ

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Nonlinear Regression Functions. SW Ch 8 1/54/

Using An Ordered Logistic Regression Model with SAS Vartanian: SW 541

2. Simple Linear Regression

Nominal and ordinal logistic regression

How To Model A Series With Sas

I n d i a n a U n i v e r s i t y U n i v e r s i t y I n f o r m a t i o n T e c h n o l o g y S e r v i c e s

Premaster Statistics Tutorial 4 Full solutions

MULTIPLE REGRESSION EXAMPLE

Notes on Applied Linear Regression

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Regression III: Advanced Methods

Didacticiel - Études de cas

Univariate Regression

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Linear Threshold Units

Chapter 4 and 5 solutions

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

5. Multiple regression

Testing for Lack of Fit

ADVANCED FORECASTING MODELS USING SAS SOFTWARE

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Chapter 7: Simple linear regression Learning Objectives

ECON 142 SKETCH OF SOLUTIONS FOR APPLIED EXERCISE #2

Statistics 104 Final Project A Culture of Debt: A Study of Credit Card Spending in America TF: Kevin Rader Anonymous Students: LD, MH, IW, MY

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

13. Poisson Regression Analysis

Topic 3. Chapter 5: Linear Regression in Matrix Form

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

Example: Boats and Manatees

Module 5: Multiple Regression Analysis

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Statistical issues in the analysis of microarray data

2013 MBA Jump Start Program. Statistics Module Part 3

Moderation. Moderation

Regression Modeling Strategies

Linear Models for Classification

ON THE DEGREES OF FREEDOM IN RICHLY PARAMETERISED MODELS

Detecting Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo

Lecture 14: GLM Estimation and Logistic Regression

Week 5: Multiple Linear Regression

Data Mining: Algorithms and Applications Matrix Math Review

Stepwise Regression. Chapter 311. Introduction. Variable Selection Procedures. Forward (Step-Up) Selection

A Property & Casualty Insurance Predictive Modeling Process in SAS

Please follow the directions once you locate the Stata software in your computer. Room 114 (Business Lab) has computers with Stata software

Assessing Model Fit and Finding a Fit Model

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

Transcription:

Data Mining and Data Warehousing Henryk Maciejewski Data Mining Predictive modelling: regression

Algorithms for Predictive Modelling Contents Regression Classification Auxiliary topics: Estimation of prediction error Reduction of dimensionality

Predictive Modelling Terminology Given input data {(x 1,y 1 ),, (x n,y n )} 1. find model of relationship between Y and X 1,X 2,,X d 2. estimate predictive performance of the model for new data n observations.............................. X 1, X 2,..., X d Y X i input variable (other names: independent variable, predictor, regressor, explanatory variable, carrier, factor, covariate) Y target variable (also response)...? Model...

Linear Methods for Regression We seek Y(X) assuming that the relationship is linear (assumption simplifies computations to fit the model) Many nonlinear problem can be modelled with linear regression by applying transformations to variables We will discuss the following problems Fitting the model to data Verifying goodness-of-fit Should the model include all the features X 1 through X d, or only best features? (especially important for high-dimensionality data)

Linear Methods for Regression Regression can also used for classification logistic regression Linearity assumption also important in classification (e.g. Linear Discriminant Analysis (LDA), separating hyperplanes (perceptron algorithm))

Theoretical Background Statistical Decission Theory Notation X R d input variables (random variables) Y R output variable (random variable) Pr(X,Y) joint probability distribution We look for a function f(x) for predicting the value of Y

Theoretical Background Statistical Decission Theory Criterion: the function should minimize the squared error: Solution regression function:

Theoretical Background Statistical Decission Theory Notice: if criterion is to minimize the L 1 norm then the solution is

From Theory to Practice Regression function f(x)=e(y X=x) is based on known joint probability distribution of X and Y How to estimate f from data: {(x 1,y 1 ),, (x n,y n )}? Different approaches: Parametric build a model of f(x) Nonparametric

Linear Regression We assume that f(x)=e(y X=x) is a linear function of X 1, X 2,..., X d : β - vector of unknown model coefficients As X j we can take: quantitative input variables nonlinear transformations of inputs (e.g., log) polynomial terms, e.g., X 2 =X 12, etc. numerically coded levels of qulitative variable

Fitting the Model Estimation of β=[β 0, β 1,, β d ] based on {(x 1,y 1 ),, (x n,y n )} by minimization of residual sum of squares RSS(β ): Solution:

Verifying the Model Goodness of fit must be verified before we attempt use the model for prediction How to check if the model fits the data well? Regression procedures in software packages (SAS, SPSS, etc.) offer several tools to do this: diagnostic plots, hypothesis tests if parameters are significant, measures of residual error, etc. we illustrate these by example

Linear Regression Example How weight of children depends on height and age? Solution using SAS procedure reg: proc reg data=kids; model weight=height age; plot weight*age weight*height; plot r.*p.; run; Fit model w = f(h,a) Diagnostic plots for model verification

Linear Regression Example Solution: weight = -127.8 + 3.09 x height + 2.4 x age Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1-127.8199 12.09900-10.56 <.0001 height 1 3.09005 0.25734 12.01 <.0001 age 1 2.40275 0.55103 4.36 <.0001

Linear Regression Example Overall measures concering the fit: Root MSE Mean Square Error in regression R-Square regression accounts for 63% of variance in data is explained by the regression model ( 1 implies that the model is appropriate) Root MSE 11.86836 R-Square 0.6305 Dependent Mean 101.30802 Adj R-Sq 0.6273 Coeff Var 11.71512

Linear Regression Example Testing if parameters of the model are significant t Value test of hypothesis H0: β j = 0 p Value < 0.05 reject H0 model parameters are significant Parameter Estimates Variable DF Parameter Estimate Standard Error t Value Pr > t Intercept 1-127.8199 12.09900-10.56 <.0001 height 1 3.09005 0.25734 12.01 <.0001 age 1 2.40275 0.55103 4.36 <.0001

Linear Regression Example Diagnostic plots visually verify linearity assumption weight vs age

Linear Regression Example Diagnostic plots visually verify linearity assumption weight vs height

Linear Regression Example Diagnostic plots Residual vs Predicted values Residual = y observed -y predicted Trend in shape may indicate model inadequacy

Linear Regression Finding Best Regressors Problem: regression based on all features X or only best features should be found? proc reg data=kids; model weight=height age / selection=forward; model weight=height age / selection=backward; model weight=height age / selection=stepwise; run; Forward subsequent parameters added to model Backward starting with complete model, parameters are removed Stepwise similar to forward (but parameter can be removed after being added to the model)

Linear Regression Finding Best Regressors Forward Selection: Step 1 Variable height Entered: R-Square = 0.6004 and C(p) = 20.0137 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 1 53555 53555 353.14 <.0001 Error 235 35639 151.65526 Test H0 that smaller model is appropriate Corrected Total 236 89194 Variable Parameter Estimate Standard Error Type II SS F Value Pr > F Intercept -132.99101 12.49370 17184 113.31 <.0001 height 3.81815 0.20318 53555 353.14 <.0001

Linear Regression Finding Best Regressors Forward Selection: Step 2 Variable age Entered: R-Square = 0.6305 and C(p) = 3.0000 Analysis of Variance Source DF Sum of Squares Mean Square F Value Pr > F Model 2 56233 28117 199.61 <.0001 Error 234 32961 140.85795 Corrected Total 236 89194 Variable Parameter Estimate Standard Error Type II SS F Value Pr > F Intercept -127.81991 12.09900 15721 111.61 <.0001 height 3.09005 0.25734 20309 144.18 <.0001 age 2.40275 0.55103 2678.22612 19.01 <.0001

Regression vs nonparametric approach Linear regression f(x)=e(y X=x) assumption: linear function of X 1,,X d Nearest neighbours method N k (x) neighbourhood of x containing k points closest to x Attempt to estimate f(x)=e(y X=x) directly from data Theorem. If n,k, k/n 0, then f^(x) E(Y X=x)