Chapter 6. Classification and Prediction

Similar documents

What is Candidate Sampling

Forecasting the Direction and Strength of Stock Market Movement

Lecture 2: Single Layer Perceptrons Kevin Swingler

Data Mining for Knowledge Management. Classification

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Single and multiple stage classifiers implementing logistic discrimination

Improved Mining of Software Complexity Data on Evolutionary Filtered Training Sets

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Classification and Prediction

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Statistical Methods to Develop Rating Models

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

The Greedy Method. Introduction. 0/1 Knapsack Problem

CHAPTER 14 MORE ABOUT REGRESSION

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

Logistic Regression. Steve Kroon

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

A novel Method for Data Mining and Classification based on

Cluster Analysis. Cluster Analysis

Searching for Interacting Features for Spam Filtering

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

L10: Linear discriminants analysis

A study on the ability of Support Vector Regression and Neural Networks to Forecast Basic Time Series Patterns

Bag-of-Words models. Lecture 9. Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka

The OC Curve of Attribute Acceptance Plans

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

Data Mining Analysis and Modeling for Marketing Based on Attributes of Customer Relationship

1 Example 1: Axis-aligned rectangles

An Alternative Way to Measure Private Equity Performance

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

Calculating the high frequency transmission line parameters of power cables

Portfolio Loss Distribution

BERNSTEIN POLYNOMIALS

A COLLABORATIVE TRADING MODEL BY SUPPORT VECTOR REGRESSION AND TS FUZZY RULE FOR DAILY STOCK TURNING POINTS DETECTION

Extending Probabilistic Dynamic Epistemic Logic

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

Credit Limit Optimization (CLO) for Credit Cards

Lecture 5,6 Linear Methods for Classification. Summary

1. Measuring association using correlation and regression

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Dropout: A Simple Way to Prevent Neural Networks from Overfitting

Abstract. Clustering ensembles have emerged as a powerful method for improving both the

An artificial Neural Network approach to monitor and diagnose multi-attribute quality control processes. S. T. A. Niaki*

An Interest-Oriented Network Evolution Mechanism for Online Communities

Performance Analysis and Coding Strategy of ECOC SVMs

AUTHENTICATION OF OTTOMAN ART CALLIGRAPHERS

A practical approach to combine data mining and prognostics for improved predictive maintenance

Decision Tree Model for Count Data

8 Algorithm for Binary Searching in Trees

Predicting Software Development Project Outcomes *

A DATA MINING APPLICATION IN A STUDENT DATABASE

STATISTICAL DATA ANALYSIS IN EXCEL

Detecting Credit Card Fraud using Periodic Features

How To Calculate The Accountng Perod Of Nequalty

Marginal Benefit Incidence Analysis Using a Single Cross-section of Data. Mohamed Ihsan Ajwad and Quentin Wodon 1. World Bank.

An RFID Distance Bounding Protocol

BANKRUPTCY PREDICTION BY USING SUPPORT VECTOR MACHINES AND GENETIC ALGORITHMS

Biometric Signature Processing & Recognition Using Radial Basis Function Network

Web Spam Detection Using Machine Learning in Specific Domain Features

A Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification

SIMPLE LINEAR CORRELATION

Realistic Image Synthesis

Economic Interpretation of Regression. Theory and Applications

Gender Classification for Real-Time Audience Analysis System

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

Fast Fuzzy Clustering of Web Page Collections

Hybrid-Learning Methods for Stock Index Modeling

Chapter XX More advanced approaches to the analysis of survey data. Gad Nathan Hebrew University Jerusalem, Israel. Abstract

Cluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms

Using Supervised Clustering Technique to Classify Received Messages in 137 Call Center of Tehran City Council

Estimating the Number of Clusters in Genetics of Acute Lymphoblastic Leukemia Data

1 De nitions and Censoring

Interpreting Patterns and Analysis of Acute Leukemia Gene Expression Data by Multivariate Statistical Analysis

Loop Parallelization

Automated information technology for ionosphere monitoring of low-orbit navigation satellite signals

Heuristic Static Load-Balancing Algorithm Applied to CESM

Prediction of Disability Frequencies in Life Insurance

DATA MINING CLASSIFICATION ALGORITHMS FOR KIDNEY DISEASE PREDICTION

How To Find The Dsablty Frequency Of A Clam

Transcription:

Chapter 6. Classfcaton and Predcton What s classfcaton? What s Lazy learners (or learnng from predcton? your neghbors) Issues regardng classfcaton and Frequent-pattern-based predcton classfcaton Classfcaton by decson tree Other classfcaton methods nducton Predcton Classfcaton by back propagaton Accuracy and error measures October 25, 2013 Data Mnng: Concepts and Technques 1

Supervsed vs. Unsupervsed Learnng Supervsed learnng (classfcaton) Supervson: The tranng data (observatons, measurements, etc.) are accompaned by labels ndcatng the class of the observatons New data s classfed based on the tranng set Unsupervsed learnng (clusterng) The class labels of tranng data s unknown Gven a set of measurements, observatons, etc. wth the am of establshng the exstence of classes or clusters n the data October 25, 2013 Data Mnng: Concepts and Technques 2

Classfcaton vs. Predcton Classfcaton predcts categorcal class labels (dscrete or nomnal) classfes data (constructs a model) based on the tranng set and the values (class labels) n a classfyng attrbute and uses t n classfyng new data Predcton models contnuous-valued functons,.e., predcts unknown or mssng values Typcal applcatons Credt/loan approval: Medcal dagnoss: f a tumor s cancerous or bengn Fraud detecton: f a transacton s fraudulent Web page categorzaton: whch category t s October 25, 2013 Data Mnng: Concepts and Technques 3

Classfcaton A Two-Step Process Model constructon: descrbng a set of predetermned classes Each tuple/sample s assumed to belong to a predefned class, as determned by the class label attrbute The set of tuples used for model constructon s tranng set The model s represented as classfcaton rules, decson trees, or mathematcal formulae Model usage: for classfyng future or unknown objects Estmate accuracy of the model The known label of test sample s compared wth the classfed result from the model Accuracy rate s the percentage of test set samples that are correctly classfed by the model Test set s ndependent of tranng set, otherwse over-fttng wll occur If the accuracy s acceptable, use the model to classfy data tuples whose class labels are not known October 25, 2013 Data Mnng: Concepts and Technques 4

Process (1): Model Constructon Tranng Data Classfcaton Algorthms NAM E RANK YEARS TENURED M ke Assstant Prof 3 no M ary Assstant Prof 7 yes Bll Professor 2 yes Jm Assocate Prof 7 yes Dave Assstant Prof 6 no Anne Assocate Prof 3 no Classfer (Model) IF rank = professor OR years > 6 THEN tenured = yes October 25, 2013 Data Mnng: Concepts and Technques 5

Process (2): Usng the Model n Predcton Classfer Testng Data Unseen Data (Jeff, Professor, 4) NAM E RANK YEARS TENURED Tom Assstant Prof 2 no M erlsa Assocate Prof 7 no G eorge Professor 5 yes Joseph Assstant Prof 7 yes Tenured? October 25, 2013 Data Mnng: Concepts and Technques 6

Issues: Data Preparaton Data cleanng Preprocess data n order to reduce nose and handle mssng values Relevance analyss (feature selecton) Remove the rrelevant or redundant attrbutes Data transformaton Generalze and/or normalze data October 25, 2013 Data Mnng: Concepts and Technques 7

Issues: Evaluatng Classfcaton Methods Accuracy classfer accuracy: predctng class label predctor accuracy: guessng value of predcted attrbutes Speed tme to construct the model (tranng tme) tme to use the model (classfcaton/predcton tme) Robustness: handlng nose and mssng values Scalablty: effcency n dsk-resdent databases Interpretablty understandng and nsght provded by the model Other measures, e.g., goodness of rules, such as decson tree sze or compactness of classfcaton rules October 25, 2013 Data Mnng: Concepts and Technques 8

Decson Tree Inducton: Tranng Dataset Ths follows an example of Qunlan s ID3 (Playng Tenns) age ncome student credt_ratng buys_computer <=30 hgh no far no <=30 hgh no excellent no 31 40 hgh no far yes >40 medum no far yes >40 low yes far yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medum no far no <=30 low yes far yes >40 medum yes far yes <=30 medum yes excellent yes 31 40 medum no excellent yes 31 40 hgh yes far yes >40 medum no excellent no October 25, 2013 Data Mnng: Concepts and Technques 9

Output: A Decson Tree for buys_computer age? <=30 overcast 31..40 >40 student? yes credt ratng? no yes excellent far no yes yes no October 25, 2013 Data Mnng: Concepts and Technques 10

Algorthm for Decson Tree Inducton Basc algorthm (a greedy algorthm) Tree s constructed n a top-down recursve dvde-and-conquer manner At start, all the tranng examples are at the root Attrbutes are categorcal (f contnuous-valued, they are dscretzed n advance) Examples are parttoned recursvely based on selected attrbutes Test attrbutes are selected on the bass of a heurstc or statstcal measure (e.g., nformaton gan) Condtons for stoppng parttonng All samples for a gven node belong to the same class There are no remanng attrbutes for further parttonng majorty votng s employed for classfyng the leaf There are no samples left October 25, 2013 Data Mnng: Concepts and Technques 11

Attrbute Selecton Measure: Informaton Gan (ID3/C4.5) Select the attrbute wth the hghest nformaton gan Let p be the probablty that an arbtrary tuple n D belongs to class C, estmated by C, D / D Expected nformaton (entropy) needed to classfy a tuple n D: m Info D) = p log ( p ) ( 2 = 1 Informaton needed (after usng A to splt D nto v v parttons) to classfy D: Dj Info A( D) = I( D D Informaton ganed by branchng on attrbute A j= 1 Gan(A) = Info(D) j ) Info (D) A October 25, 2013 Data Mnng: Concepts and Technques 12

Attrbute Selecton: Informaton Gan g Class P: buys_computer = yes g Class N: buys_computer = no 9 9 5 Info ( D) = I (9,5) = log ( ) log 2 14 14 14 age p n I(p, n ) <=30 2 3 0.971 31 40 4 0 0 >40 3 2 0.971 5 ( ) 14 2 = 0.940 5 I 14 Info age ( D) = (2,3) + 5 14 5 14 I (2,3) + I (3,2) = means age <=30 has 5 out of 14 samples, wth 2 yes es and 3 no s. Hence 4 14 0.694 I (4,0) Gan( age) = Info( D) Info ( D) = 0.246 age ncome student credt_ratng buys_computer <=30 hgh no far no age <=30 hgh no excellent no 31 40 hgh no far yes >40 medum no far yes Smlarly, >40 low yes far yes >40 low yes excellent no 31 40 low yes excellent yes Gan( ncome) = 0.029 <=30 medum no far no <=30 low yes far yes Gan( student) = 0.151 >40 medum yes far yes <=30 medum yes excellent yes 31 40 medum no excellent yes Gan( credt _ ratng) = 0.048 31 40 hgh yes far yes >40October medum 25, 2013 no excellent Data Mnng: no Concepts and Technques 13

Computng Informaton-Gan for Contnuous-Value Attrbutes Let attrbute A be a contnuous-valued attrbute Must determne the best splt pont for A Sort the value A n ncreasng order Typcally, the mdpont between each par of adjacent values s consdered as a possble splt pont (a +a +1 )/2 s the mdpont between the values of a and a +1 The pont wth the mnmum expected nformaton requrement for A s selected as the splt-pont for A Splt: D1 s the set of tuples n D satsfyng A splt-pont, and D2 s the set of tuples n D satsfyng A > splt-pont October 25, 2013 Data Mnng: Concepts and Technques 14

Gan Rato for Attrbute Selecton (C4.5) Informaton gan measure s based towards attrbutes wth a large number of values C4.5 (a successor of ID3) uses gan rato to overcome the problem (normalzaton to nformaton gan) Ex. SpltInfo v D j D j D) = log ( ) j 1 D D A ( 2 = GanRato(A) = Gan(A)/SpltInfo(A) 4 4 6 6 4 SpltInfo A ( D) = log ( ) log 2 ( ) log 2 14 14 14 14 14 gan_rato(ncome) = 0.029/0.926 = 0.031 4 ( ) 14 2 = The attrbute wth the maxmum gan rato s selected as the splttng attrbute 0.926 October 25, 2013 Data Mnng: Concepts and Technques 15

Gn ndex (CART, IBM IntellgentMner) If a data set D contans examples from n classes, gn ndex, gn(d) s defned as where p j s the relatve frequency of class j n D If a data set D s splt on A nto two subsets D 1 and D 2, the gn ndex gn(d) s defned as Reducton n Impurty: gn ( D) = 1 n p 2 j j = 1 D ( ) 1 D ( ) 2 gn A D = gn D1 + gn ( D 2) D D gn( A) = gn( D) gn ( D) The attrbute provdes the smallest gn splt (D) (or the largest reducton n mpurty) s chosen to splt the node (need to enumerate all the possble splttng ponts for each attrbute) A October 25, 2013 Data Mnng: Concepts and Technques 16

Gn ndex (CART, IBM IntellgentMner) Ex. D has 9 tuples n buys_computer = yes and 5 n no gn( D) = 1 = 0.459 Suppose the attrbute ncome parttons D nto 10 n D 1 : {low, medum} and 4 n D 10 4 2 gnncome low, medum} D) = Gn( D1 ) + Gn( 14 14 9 14 2 5 14 { ( D1 2 ) but gn {medum,hgh} s 0.30 and thus the best snce t s the lowest All attrbutes are assumed contnuous-valued May need other tools, e.g., clusterng, to get the possble splt values Can be modfed for categorcal attrbutes October 25, 2013 Data Mnng: Concepts and Technques 17

Comparng Attrbute Selecton Measures The three measures, n general, return good results but Informaton gan: based towards multvalued attrbutes Gan rato: tends to prefer unbalanced splts n whch one partton s much smaller than the others Gn ndex: based to multvalued attrbutes has dffculty when # of classes s large tends to favor tests that result n equal-szed parttons and purty n both parttons October 25, 2013 Data Mnng: Concepts and Technques 18

Overfttng and Tree Prunng Overfttng: An nduced tree may overft the tranng data Too many branches, some may reflect anomales due to nose or outlers Poor accuracy for unseen samples Two approaches to avod overfttng Preprunng: Halt tree constructon early do not splt a node f ths would result n the goodness measure fallng below a threshold Dffcult to choose an approprate threshold Postprunng: Remove branches from a fully grown tree get a sequence of progressvely pruned trees Use a set of data dfferent from the tranng data to decde whch s the best pruned tree October 25, 2013 Data Mnng: Concepts and Technques 19

Classfcaton n Large Databases Classfcaton a classcal problem extensvely studed by statstcans and machne learnng researchers Scalablty: Classfyng data sets wth mllons of examples and hundreds of attrbutes wth reasonable speed Why decson tree nducton n data mnng? relatvely faster learnng speed (than other classfcaton methods) convertble to smple and easy to understand classfcaton rules can use SQL queres for accessng databases comparable classfcaton accuracy wth other methods October 25, 2013 Data Mnng: Concepts and Technques 20

Classfcaton by Backpropagaton Backpropagaton: A neural network learnng algorthm Started by psychologsts and neurobologsts to develop and test computatonal analogues of neurons A neural network: A set of connected nput/output unts where each connecton has a weght assocated wth t Durng the learnng phase, the network learns by adjustng the weghts so as to be able to predct the correct class label of the nput tuples Also referred to as connectonst learnng due to the connectons between unts October 25, 2013 Data Mnng: Concepts and Technques 21

Neural Network as a Classfer Weakness Long tranng tme Requre a number of parameters typcally best determned emprcally, e.g., the network topology or structure. Poor nterpretablty: Dffcult to nterpret the symbolc meanng behnd the learned weghts and of hdden unts n the network Strength Hgh tolerance to nosy data Ablty to classfy untraned patterns Well-suted for contnuous-valued nputs and outputs Successful on a wde array of real-world data Algorthms are nherently parallel Technques have recently been developed for the extracton of rules from traned neural networks October 25, 2013 Data Mnng: Concepts and Technques 22

A Neuron (= a perceptron) - µ k x 0 w 0 x 1 x n w 1 w n f output y For Example Input vector x weght vector w weghted sum Actvaton functon y n = sgn( w x µ k ) = 0 The n-dmensonal nput vector x s mapped nto varable y by means of the scalar product and a nonlnear functon mappng October 25, 2013 Data Mnng: Concepts and Technques 23

A Mult-Layer Feed-Forward Neural Network Output vector w = w + λ( y yˆ ) x ( k + 1) j ( k ) j ( k ) j Output layer Hdden layer w j Input layer Input vector: X October 25, 2013 Data Mnng: Concepts and Technques 24

How A Mult-Layer Neural Network Works? The nputs to the network correspond to the attrbutes measured for each tranng tuple Inputs are fed smultaneously nto the unts makng up the nput layer They are then weghted and fed smultaneously to a hdden layer The number of hdden layers s arbtrary, although usually only one The weghted outputs of the last hdden layer are nput to unts makng up the output layer, whch emts the network's predcton The network s feed-forward n that none of the weghts cycles back to an nput unt or to an output unt of a prevous layer From a statstcal pont of vew, networks perform nonlnear regresson: Gven enough hdden unts and enough tranng samples, they can closely approxmate any functon October 25, 2013 Data Mnng: Concepts and Technques 25

Defnng a Network Topology Frst decde the network topology: # of unts n the nput layer, # of hdden layers (f > 1), # of unts n each hdden layer, and # of unts n the output layer Normalzng the nput values for each attrbute measured n the tranng tuples to [0.0 1.0] One nput unt per doman value, each ntalzed to 0 Output, f for classfcaton and more than two classes, one output unt per class s used Once a network has been traned and ts accuracy s unacceptable, repeat the tranng process wth a dfferent network topology or a dfferent set of ntal weghts October 25, 2013 Data Mnng: Concepts and Technques 26

Backpropagaton Iteratvely process a set of tranng tuples & compare the network's predcton wth the actual known target value For each tranng tuple, the weghts are modfed to mnmze the mean squared error between the network's predcton and the actual target value Modfcatons are made n the backwards drecton: from the output layer, through each hdden layer down to the frst hdden layer, hence backpropagaton Steps Intalze weghts (to small random #s) and bases n the network Propagate the nputs forward (by applyng actvaton functon) Backpropagate the error (by updatng weghts and bases) Termnatng condton (when error s very small, etc.) October 25, 2013 Data Mnng: Concepts and Technques 27

Backpropagaton and Interpretablty Effcency of backpropagaton: Each epoch (one nteraton through the tranng set) takes O( D * w), wth D tuples and w weghts, but # of epochs can be exponental to n, the number of nputs, n the worst case Rule extracton from networks: network prunng Smplfy the network structure by removng weghted lnks that have the least effect on the traned network Then perform lnk, unt, or actvaton value clusterng The set of nput and actvaton values are studed to derve rules descrbng the relatonshp between the nput and hdden unt layers Senstvty analyss: assess the mpact that a gven nput varable has on a network output. The knowledge ganed from ths analyss can be represented n rules October 25, 2013 Data Mnng: Concepts and Technques 28

Lazy vs. Eager Learnng Lazy vs. eager learnng Lazy learnng (e.g., nstance-based learnng): Smply stores tranng data (or only mnor processng) and wats untl t s gven a test tuple Eager learnng (the above dscussed methods): Gven a set of tranng set, constructs a classfcaton model before recevng new (e.g., test) data to classfy Lazy: less tme n tranng but more tme n predctng Accuracy Lazy method effectvely uses a rcher hypothess space snce t uses many local lnear functons to form ts mplct global approxmaton to the target functon Eager: must commt to a sngle hypothess that covers the entre nstance space October 25, 2013 Data Mnng: Concepts and Technques 29

Lazy Learner: Instance-Based Methods Instance-based learnng: Store tranng examples and delay the processng ( lazy evaluaton ) untl a new nstance must be classfed Typcal approaches k-nearest neghbor approach Instances represented as ponts n a Eucldean space. Locally weghted regresson Constructs local approxmaton Case-based reasonng Uses symbolc representatons and knowledgebased nference October 25, 2013 Data Mnng: Concepts and Technques 30

The k-nearest Neghbor Algorthm All nstances correspond to ponts n the n-d space The nearest neghbor are defned n terms of Eucldean dstance, dst(x 1, X 2 ) Target functon could be dscrete- or real- valued For dscrete-valued, k-nn returns the most common value among the k tranng examples nearest to x q Vonoro dagram: the decson surface nduced by 1- NN for a typcal set of tranng examples _ + _. + _ + x q _ + October 25, 2013 Data Mnng: Concepts and Technques 31.....

Dscusson on the k-nn Algorthm k-nn for real-valued predcton for a gven unknown tuple Returns the mean values of the k nearest neghbors Dstance-weghted nearest neghbor algorthm Weght the contrbuton of each of the k neghbors accordng to ther dstance to the query x q w Gve greater weght to closer neghbors d( x q Robust to nosy data by averagng k-nearest neghbors 1, x ) 2 Curse of dmensonalty: dstance between neghbors could be domnated by rrelevant attrbutes To overcome t, axes stretch or elmnaton of the least relevant attrbutes October 25, 2013 Data Mnng: Concepts and Technques 32

Genetc Algorthms (GA) Genetc Algorthm: based on an analogy to bologcal evoluton An ntal populaton s created consstng of randomly generated rules Each rule s represented by a strng of bts E.g., f A 1 and A 2 then C 2 can be encoded as 100 If an attrbute has k > 2 values, k bts can be used Based on the noton of survval of the fttest, a new populaton s formed to consst of the fttest rules and ther offsprngs The ftness of a rule s represented by ts classfcaton accuracy on a set of tranng examples Offsprngs are generated by crossover and mutaton The process contnues untl a populaton P evolves when each rule n P satsfes a prespecfed threshold Slow but easly parallelzable October 25, 2013 Data Mnng: Concepts and Technques 33

What Is Predcton? (Numercal) predcton s smlar to classfcaton construct a model use model to predct contnuous or ordered value for a gven nput Predcton s dfferent from classfcaton Classfcaton refers to predct categorcal class label Predcton models contnuous-valued functons Major method for predcton: regresson model the relatonshp between one or more ndependent or predctor varables and a dependent or response varable Regresson analyss Lnear and multple regresson Non-lnear regresson Other regresson methods: generalzed lnear model, Posson regresson, log-lnear models, regresson trees October 25, 2013 Data Mnng: Concepts and Technques 34

Lnear Regresson Lnear regresson: nvolves a response varable y and a sngle predctor varable x y = w 0 + w 1 x where w 0 (y-ntercept) and w 1 (slope) are regresson coeffcents Method of least squares: estmates the best-fttng straght lne w D ( x = 1 = D 1 = 1 x)( y ( x x) 2 y) w = y w x 0 1 Multple lnear regresson: nvolves more than one predctor varable Tranng data s of the form (X 1, y 1 ), (X 2, y 2 ),, (X D, y D ) Ex. For 2-D data, we may have: y = w 0 + w 1 x 1 + w 2 x 2 Solvable by extenson of least square method or usng SAS, S-Plus Many nonlnear functons can be transformed nto the above October 25, 2013 Data Mnng: Concepts and Technques 35

Nonlnear Regresson Some nonlnear models can be modeled by a polynomal functon A polynomal regresson model can be transformed nto lnear regresson model. For example, y = w 0 + w 1 x + w 2 x 2 + w 3 x 3 convertble to lnear wth new varables: x 2 = x 2, x 3 = x 3 y = w 0 + w 1 x + w 2 x 2 + w 3 x 3 Other functons, such as power functon, can also be transformed to lnear model Some models are ntractable nonlnear (e.g., sum of exponental terms) possble to obtan least square estmates through extensve calculaton on more complex formulae October 25, 2013 Data Mnng: Concepts and Technques 36

Other Regresson-Based Models Generalzed lnear model: Foundaton on whch lnear regresson can be appled to modelng categorcal response varables Varance of y s a functon of the mean value of y, not a constant Logstc regresson: models the prob. of some event occurrng as a lnear functon of a set of predctor varables Posson regresson: models the data that exhbt a Posson dstrbuton Log-lnear models: (for categorcal data) Approxmate dscrete multdmensonal prob. dstrbutons Also useful for data compresson and smoothng Regresson trees and model trees Trees to predct contnuous values rather than class labels October 25, 2013 Data Mnng: Concepts and Technques 37

Predcton: Numercal Data October 25, 2013 Data Mnng: Concepts and Technques 38

Predcton: Categorcal Data October 25, 2013 Data Mnng: Concepts and Technques 39

Classfer Accuracy Measures Real class\predcted class C 1 ~C 1 C 1 True postve False negatve ~C 1 False postve True negatve Real class\predcted class buy_computer = yes buy_computer = no total recognton(%) buy_computer = yes 6954 46 7000 99.34 buy_computer = no 412 2588 3000 86.27 total 7366 2634 10000 95.52 Accuracy of a classfer M, acc(m): percentage of test set tuples that are correctly classfed by the model M Error rate (msclassfcaton rate) of M = 1 acc(m) Gven m classes, CM,j, an entry n a confuson matrx, ndcates # of tuples n class that are labeled by the classfer as class j Alternatve accuracy measures (e.g., for cancer dagnoss) senstvty = t-pos/pos /* true postve recognton rate */ specfcty = t-neg/neg /* true negatve recognton rate */ precson = t-pos/(t-pos + f-pos) accuracy = senstvty * pos/(pos + neg) + specfcty * neg/(pos + neg) Ths model can also be used for cost-beneft analyss October 25, 2013 Data Mnng: Concepts and Technques 40

October 25, 2013 Data Mnng: Concepts and Technques 41 Predctor Error Measures Measure predctor accuracy: measure how far off the predcted value s from the actual known value Loss functon: measures the error betw. y and the predcted value y Absolute error: y y Squared error: (y y ) 2 Test error (generalzaton error): the average loss over the test set Mean absolute error: Mean squared error: Relatve absolute error: Relatve squared error: The mean squared-error exaggerates the presence of outlers Popularly use (square) root mean-square error, smlarly, root relatve squared error d y y d = 1 ' d y y d = 1 2 ') ( = = d d y y y y 1 1 ' = = d d y y y y 1 2 1 2 ) ( ') (

Evaluatng the Accuracy of a Classfer or Predctor (I) Holdout method Gven data s randomly parttoned nto two ndependent sets Tranng set (e.g., 2/3) for model constructon Test set (e.g., 1/3) for accuracy estmaton Random samplng: a varaton of holdout Repeat holdout k tmes, accuracy = avg. of the accuraces obtaned Cross-valdaton (k-fold, where k = 10 s most popular) Randomly partton the data nto k mutually exclusve subsets, each approxmately equal sze At -th teraton, use D as test set and others as tranng set Leave-one-out: k folds where k = # of tuples, for small szed data Stratfed cross-valdaton: folds are stratfed so that class dst. n each fold s approx. the same as that n the ntal data October 25, 2013 Data Mnng: Concepts and Technques 42

Model Selecton: ROC Curves ROC (Recever Operatng Characterstcs) curves: for vsual comparson of classfcaton models Orgnated from sgnal detecton theory Shows the trade-off between the true postve rate and the false postve rate The area under the ROC curve s a measure of the accuracy of the model Rank the test tuples n decreasng order: the one that s most lkely to belong to the postve class appears at the top of the lst The closer to the dagonal lne (.e., the closer the area s to 0.5), the less accurate s the model Vertcal axs represents the true postve rate Horzontal axs rep. the false postve rate The plot also shows a dagonal lne A model wth perfect accuracy wll have an area of 1.0 October 25, 2013 Data Mnng: Concepts and Technques 43