Classification Techniques (1)



Similar documents
Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Data Mining Classification: Decision Trees

Social Media Mining. Data Mining Essentials

Knowledge Discovery and Data Mining

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Data Mining for Knowledge Management. Classification

Classification algorithm in Data mining: An Overview

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Data Mining Individual Assignment report

Classification and Prediction

An Overview of Knowledge Discovery Database and Data mining Techniques

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Foundations of Artificial Intelligence. Introduction to Data Mining

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Review: Classification Outline

: Introduction to Machine Learning Dr. Rita Osadchy

Machine Learning using MapReduce

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Distances, Clustering, and Classification. Heatmaps

DATA MINING TECHNIQUES AND APPLICATIONS

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Quick Introduction of Data Mining Techniques

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Cluster Analysis: Basic Concepts and Algorithms

Data Mining Part 5. Prediction

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

Lecture 6 - Data Mining Processes

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Data Preprocessing. Week 2

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Segmentation of stock trading customers according to potential value

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Unsupervised learning: Clustering

Lecture 10: Regression Trees

Machine Learning in Spam Filtering

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

A Comparative Analysis of Classification Techniques on Categorical Data in Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Introduction to Data Mining

Customer Classification And Prediction Based On Data Mining Technique

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

How To Cluster

Chapter 7. Cluster Analysis

Data Mining: A Preprocessing Engine

Automated News Item Categorization

Model Selection. Introduction. Model Selection

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Professor Anita Wasilewska. Classification Lecture Notes

Data Mining 5. Cluster Analysis

Data Mining. Nonlinear Classification

Chapter 6. The stacking ensemble approach

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Data Mining Techniques Chapter 6: Decision Trees

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Local classification and local likelihoods

Course 395: Machine Learning

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Chapter 12 Discovering New Knowledge Data Mining

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Comparison of Classification Techniques for Heart Health Analysis System

Classification and Prediction techniques using Machine Learning for Anomaly Detection.

Data Mining - Evaluation of Classifiers

Applying Data Mining to Demand Forecasting and Product Allocations

Data Mining. Toon Calders

Data Mining Essentials

Knowledge Discovery and Data Mining

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

K-Means Clustering Tutorial

Data Mining: Foundation, Techniques and Applications

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

1 What is Machine Learning?

Maschinelles Lernen mit MATLAB

Learning is a very general term denoting the way in which agents:

Data Mining: An Overview. David Madigan

Using multiple models: Bagging, Boosting, Ensembles, Forests

Data Mining Techniques

Predicting Student Performance by Using Data Mining Methods for Classification

Numerical Algorithms Group

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Practical Machine Learning Tools and Techniques

Adaptive Anomaly Detection for Network Security

OUTLIER ANALYSIS. Data Mining 1

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Advanced Ensemble Strategies for Polynomial Models

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Decision Support System on Prediction of Heart Disease Using Data Mining Techniques

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

MACHINE LEARNING BASICS WITH R

Transcription:

10 10 Overview Classification Techniques (1) Today Classification Problem Classification based on Regression Distance-based Classification (KNN) Net Lecture Decision Trees Classification using Rules Quality of Classifiers Data Mining Lecture 3: Classification 1 2 Classification Problem Given a database D = {t 1,t 2,,t n } and a finite set of classes C = {C 1,,C m }, the Classification Problem is to define a mapping f:d C where each t i is assigned to one class. Actually, f divides D into equivalence classes. Prediction is a similar process, but may be viewed as having an infinite number of classes. Data Mining Lecture 3: Classification 1 3 Classification: Definition Given a collection of records (training set) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Data Mining Lecture 3: Classification 1 4 Illustrating Classification Task Eamples of Classification Tasks Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No Learning algorithm Predicting tumor cells as benign or malignant 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes Induction Learn Model Classifying credit card transactions as legitimate or fraudulent 9 No Medium 75K No 10 No Small 90K Yes Training Set Attrib1 Attrib2 Attrib3 Tid Class 11 No Small 55K? Apply Model Model Classifying secondary structures of protein as alpha-heli, beta-sheet, or random coil 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Test Set Deduction Categorizing news stories as finance, weather, entertainment, sports, etc Data Mining Lecture 3: Classification 1 5 Data Mining Lecture 3: Classification 1 6 1

More Classification Eamples Classification Eample: Grading Eams Classify students grades as VG, G, or U. Identify mushrooms as poisonous or edible. Classify stocks as buy, keep, or sell. Identify individuals with credit risks. Perform speech recognition. Perform pattern recognition. If >= 28 then grade = VG If 18 <= < 28 then grade = G If < 18 then grade = U <28 <18 U >=28 VG >=18 G Data Mining Lecture 3: Classification 1 7 Data Mining Lecture 3: Classification 1 8 Classification Eample: Letter Recognition View letters as constructed from 5 components: Letter A Letter B Classification Techniques Approach: Create specific model by evaluating training data (or by using knowledge from domain eperts). Apply model developed to new data. Classes must be predefined. Letter C Letter E Letter D Letter F Most common techniques use decision trees (DTs), neural networks (NNs), or are based on distances or statistical methods. Data Mining Lecture 3: Classification 1 9 Data Mining Lecture 3: Classification 1 10 Defining Classes Issues in Classification 10 5 Class B Class A Class C 0 1 2 3 4 5 6 7 8 Partitioning Based 10 5 Distance Based Class B Class A Class C 0 1 2 3 4 5 6 7 8 Missing Data Ignore Replace with assumed value Handling of Outliers in Training Data Measuring Performance Classification accuracy on test data Confusion matri Data Mining Lecture 3: Classification 1 11 Data Mining Lecture 3: Classification 1 12 2

Handling of Outliers in Training Data Height Eample Data? With Outliers Without Outliers Data Mining Lecture 3: Classification 1 13 Nam e Gender Height O utput1 O utput2 Kristina F 1.60 Short M edium Jim M 2.02 Tall M edium M aggie F 1.90 M edium Tall M artha F 1.88 M edium Tall Stephanie F 1.71 Short M edium Bob M 1.85 M edium M edium Kathy F 1.60 Short M edium Dave M 1.72 Short M edium W orth M 2.12 Tall Tall Steven M 2.10 Tall Tall Debbie F 1.78 M edium M edium Todd M 1.95 M edium M edium Kim F 1.89 M edium Tall Am y F 1.81 M edium M edium W ynette F 1.75 M edium M edium Data Mining Lecture 3: Classification 1 14 Classification Performance Confusion Matri Eample True Positive False Negative Using height data eample with Output1 correct and Output2 actual assignment Tall Tall Classified Tall Classified 20 10 Classified Tall 45 25 Classified A c tu a l A s s ig n m e n t M e m b e rs h ip S h o rt M e d iu m T a ll S h o rt 0 4 0 M e d iu m 0 5 3 T a ll 0 1 2 False Positive True Negative Data Mining Lecture 3: Classification 1 15 Data Mining Lecture 3: Classification 1 16 Regression Linear Regression Poor Fit Assume data fits a predefined function Determine best values for regression coefficients c 0,c 1,,c n. Assume an error: y = c 0 +c 1 1 + +c n n +ε Estimate error using mean squared error for training set: Data Mining Lecture 3: Classification 1 17 Data Mining Lecture 3: Classification 1 18 3

Classification Using Regression Division Division: Use regression function to divide area into regions. Prediction: Use regression function to predict a class membership function. Input includes desired class. Data Mining Lecture 3: Classification 1 19 Data Mining Lecture 3: Classification 1 20 Prediction Instance Based Classifiers Eamples: Rote-learner Memorizes entire training data and performs classification only if attributes of record match one of the training eamples eactly Nearest neighbor Uses k closest points (nearest neighbors) for performing classification Data Mining Lecture 3: Classification 1 21 Data Mining Lecture 3: Classification 1 22 Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck Training Records Compute Distance Choose k of the nearest records Test Record Classification Using Distance Measures Place items in the class to which they are closest. Must determine distance between an item and a class. Classes represented by Centroid: Central value. Medoid: Representative point. A set of individual points Algorithm: K-Nearest Neighbors (KNN) Data Mining Lecture 3: Classification 1 23 Data Mining Lecture 3: Classification 1 24 4

Nearest-Neighbor Classifiers Definition of Nearest Neighbor Unknown record Requires three things The set of stored records Distance Metric to compute distance between records The value of k, the number of nearest neighbors to retrieve X X X To classify an unknown record: Compute distance to other training records Identify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record are data points that have the k smallest distance to Data Mining Lecture 3: Classification 1 25 Data Mining Lecture 3: Classification 1 26 1 Nearest Neighbor Voronoi Diagram Nearest Neighbor Classification Compute distance between two points: Euclidean distance d( p, q) = i ( p i q i ) 2 Determine the class from nearest neighbor list take the majority vote of class labels among the k- nearest neighbors Weigh the vote according to distance weight factor, w = 1/d 2 Data Mining Lecture 3: Classification 1 27 Data Mining Lecture 3: Classification 1 28 K Nearest Neighbors (KNN): Training set includes classes. Eamine K items near item to be classified. New item placed in class with the most number of close items. O(q) for each tuple to be classified. (Here q is the size of the training set.) Data Mining Lecture 3: Classification 1 29 KNN Algorithm Input D // training data N // neighbors t // tuple to classify Output c // class to which t gets classified KNN Algorithm N = ; for each d D do if N < K then N = N {d} ; else u = the item in N with ma distance (dissimilarity) from t ; if sim(t,u) < sim(t,d) then N = (N {u}) {d} ; c = the class to which most n N are classified ; Data Mining Lecture 3: Classification 1 30 5

KNN Algorithm Nearest Neighbor Classification: Issues? Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes X Data Mining Lecture 3: Classification 1 31 Data Mining Lecture 3: Classification 1 32 Nearest Neighbor Classification: More issues Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Eample: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M Nearest Neighbor Classification: Wrap-up k-nn classifiers are lazy learners They do not build models eplicitly Unlike eager learners such as decision tree induction and rule-based systems Classifying unknown records is relatively epensive Data Mining Lecture 3: Classification 1 33 Data Mining Lecture 3: Classification 1 34 6