Data Mining Part 5. Prediction



Similar documents
Data Mining Part 5. Prediction

Tutorial for the TI-89 Titanium Calculator

Data Mining III: Numeric Estimation

Predictive Dynamix Inc

Scatter Plot, Correlation, and Regression on the TI-83/84

The importance of graphing the data: Anscombe s regression examples

Chapter 12 Discovering New Knowledge Data Mining

Slope-Intercept Equation. Example

Data Mining 5. Cluster Analysis

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Example: Boats and Manatees

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

DATA MINING TECHNIQUES AND APPLICATIONS

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

Tutorial on Using Excel Solver to Analyze Spin-Lattice Relaxation Time Data

Binary Logistic Regression

Worksheet A5: Slope Intercept Form

Studying Auto Insurance Data

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Chapter 7: Simple linear regression Learning Objectives

CHAPTER 13 SIMPLE LINEAR REGRESSION. Opening Example. Simple Regression. Linear Regression

Pennsylvania System of School Assessment

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Classification and Prediction

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Decision Trees from large Databases: SLIQ

Diagrams and Graphs of Statistical Data

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Dealing with Data in Excel 2010

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

Aim: How do we find the slope of a line? Warm Up: Go over test. A. Slope -

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Data Mining for Knowledge Management. Classification

APPLICATION OF DATA MINING TECHNIQUES FOR BUILDING SIMULATION PERFORMANCE PREDICTION ANALYSIS.

Econometrics Simple Linear Regression

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

You buy a TV for $1000 and pay it off with $100 every week. The table below shows the amount of money you sll owe every week. Week

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

not possible or was possible at a high cost for collecting the data.

SAS Software to Fit the Generalized Linear Model

Customer Classification And Prediction Based On Data Mining Technique

Lecture 11: Chapter 5, Section 3 Relationships between Two Quantitative Variables; Correlation

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Indiana State Core Curriculum Standards updated 2009 Algebra I

Data Mining. Nonlinear Classification

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Gerry Hobbs, Department of Statistics, West Virginia University

Data Mining: STATISTICA

Calibration and Linear Regression Analysis: A Self-Guided Tutorial

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Predictive Analytics: Extracts from Red Olive foundational course

Mario Guarracino. Regression

Florida Math for College Readiness

Data Mining - Evaluation of Classifiers

Efficiency in Software Development Projects

GRADES 7, 8, AND 9 BIG IDEAS

Social Media Mining. Data Mining Essentials

with functions, expressions and equations which follow in units 3 and 4.

Machine Learning Logistic Regression

Common Core Unit Summary Grades 6 to 8

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Spreadsheets and Laboratory Data Analysis: Excel 2003 Version (Excel 2007 is only slightly different)

Part 1: Background - Graphing

Coordinate Plane, Slope, and Lines Long-Term Memory Review Review 1

4. Simple regression. QBUS6840 Predictive Analytics.

Exercise 1.12 (Pg )

Algebra 1 Course Information

Audit Analytics. --An innovative course at Rutgers. Qi Liu. Roman Chinchila

The Predictive Data Mining Revolution in Scorecards:

Curve Fitting in Microsoft Excel By William Lee

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Vertical Alignment Colorado Academic Standards 6 th - 7 th - 8 th

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

Temperature Scales. The metric system that we are now using includes a unit that is specific for the representation of measured temperatures.

Data Mining Classification: Decision Trees

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Designer: Nathan Kimball. Stage 1 Desired Results

Dimensionality Reduction: Principal Components Analysis

CS Introduction to Data Mining Instructor: Abdullah Mueen

Predicting Student Persistence Using Data Mining and Statistical Analysis Methods

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

What does the number m in y = mx + b measure? To find out, suppose (x 1, y 1 ) and (x 2, y 2 ) are two points on the graph of y = mx + b.

Correlation key concepts:

Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.

Data Mining Practical Machine Learning Tools and Techniques

Part Three. Cost Behavior Analysis

Section 1.5 Linear Models

Comparison of Data Mining Techniques used for Financial Data Analysis

Benchmarking of different classes of models used for credit scoring

The Artificial Prediction Market

Definition 8.1 Two inequalities are equivalent if they have the same solution set. Add or Subtract the same value on both sides of the inequality.

MTH 140 Statistics Videos

430 Statistics and Financial Mathematics for Business

Sample lab procedure and report. The Simple Pendulum

The Method of Least Squares

Lecture 3: Linear methods for classification

Data quality in Accounting Information Systems

Transcription:

Data Mining Part 5. Prediction 5.7 Spring 2010 Instructor: Dr. Masoud Yaghini

Outline Introduction Linear Regression Other Regression Models References

Introduction

Introduction Numerical prediction is similar to classification construct a model use model to predict continuous or ordered value for a given input Numeric prediction vs. classification Classification refers to predict categorical class label Numeric prediction models continuous-valued functions

Introduction Regression analysis is the major method for numeric prediction Regression analysis model the relationship between one or more independent or predictor variables and a dependent or response variable Regression analysis is a good choice when all of the predictor variables are continuous valued as well.

Introduction In the context of data mining The predictor variables are the attributes of interest describing the instance that are known. The response variable is what we want to predict Some classification techniques can be adapted for prediction, e.g. Backpropagation k-nearest-neighbor classifiers Support vector machines

Introduction Regression analysis methods: Linear regression Straight-line linear regression Multiple linear regression Non-linear regression Generalized linear model Poisson regression Logistic regression Log-linear models Regression trees and Model trees

Linear Regression

Linear Regression Straight-line linear regression: involves a response variable y and a single predictor variable x y = w 0 + w 1 x w 0 : y-intercept w 1 : slope w 0 & w 1 are regression coefficients

Linear regression Method of least squares: estimates the best-fitting straight line as the one that minimizes the error between the actual data and the estimate of the line. D: a training set x: values of predictor variable y: values of response variable D : data points of the form(x1, y1), (x2, y2),, (x D, y D ). : the mean value of x1, x2,, x D : the mean value of y1, y2,, y D

Example: Salary problem The table shows a set of paired data where x is the number of years of work experience of a college graduate and y is the corresponding salary of the graduate.

Linear Regression The 2-D data can be graphed on a scatter plot. The plot suggests a linear relationship between the two variables, x and y.

Example: Salary data Given the above data, we compute we get The equation of the least squares line is estimated by

Example: Salary data Linear Regression: Y=3.5*X+23.2 120 100 80 Salary 60 40 20 0 0 5 10 15 20 25 Years

Multiple linear regression Multiple linear regression involves more than one predictor variable Training data is of the form (X 1, y 1 ), (X 2, y 2 ),, (X D, y D ) where the X i are the n-dimensional training data with associated class labels, y i An example of a multiple linear regression model based on two predictor attributes:

Example: CPU performance data

Multiple Linear Regression Various statistical measures exist for determining how well the proposed model can predict y (described later). Obviously, the greater the number of predictor attributes is, the slower the performance is. Before applying regression analysis, it is common to perform attribute subset selection to eliminate attributes that are unlikely to be good predictors for y. In general, regression analysis is accurate for numeric prediction, except when the data contain outliers.

Other Regression Models

Nonlinear Regression If data that does not show a linear dependence we can get a more accurate model using a nonlinear regression model For example, y = w 0 + w 1 x + w 2 x 2 + w 3 x 3

Generalized linear models Generalized linear model is foundation on which linear regression can be applied to modeling categorical response variables Common types of generalized linear models include Logistic regression: models the probability of some event occurring as a linear function of a set of predictor variables. Poisson regression: models the data that exhibit a Poisson distribution

Log-linear models In the log-linear method, all attributes must be categorical Continuous-valued attributes must first be discretized.

Regression Trees and Model Trees Trees to predict continuous values rather than class labels Regression and model trees tend to be more accurate than linear regression when the data are not represented well by a simple linear model

Regression trees Regression tree: a decision tree where each leaf predicts a numeric quantity Proposed in CART system (Breiman et al. 1984) CART: Classification And Regression Trees Predicted value is average value of training instances that reach the leaf

Example: CPU performance problem Regression tree for the CPU data

Example: CPU performance problem We calculate the average of the absolute values of the errors between the predicted and the actual CPU performance measures It turns out to be significantly less for the tree than for the regression equation.

Model tree Model tree: Each leaf holds a regression model A multivariate linear equation for the predicted attribute Proposed by Quinlan (1992) A more general case than regression tree

Example: CPU performance problem Model tree for the CPU data

References

References J. Han, M. Kamber, Data Mining: Concepts and Techniques, Elsevier Inc. (2006). (Chapter 6) I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition, Elsevier Inc., 2005. (Chapter 6)

The end