Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors



Similar documents
Classification and Prediction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Data Mining Part 5. Prediction

Classification Techniques (1)

Data Mining for Knowledge Management. Classification

Professor Anita Wasilewska. Classification Lecture Notes

Classification algorithm in Data mining: An Overview

Knowledge Discovery and Data Mining

Data Mining: Foundation, Techniques and Applications

How To Solve The Kd Cup 2010 Challenge

Social Media Mining. Data Mining Essentials

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

A Comparative Analysis of Classification Techniques on Categorical Data in Data Mining

Data Mining Classification: Decision Trees

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Random forest algorithm in big data environment

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

Customer Classification And Prediction Based On Data Mining Technique

Data Mining: A Preprocessing Engine

Knowledge Discovery and Data Mining

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Part 5. Prediction

Learning is a very general term denoting the way in which agents:

DATA MINING TECHNIQUES AND APPLICATIONS

Experiments in Web Page Classification for Semantic Web

STOCK MARKET TRENDS USING CLUSTER ANALYSIS AND ARIMA MODEL

Machine Learning using MapReduce

Classification and Prediction

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Classification and Prediction techniques using Machine Learning for Anomaly Detection.

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Decision Support System on Prediction of Heart Disease Using Data Mining Techniques

Foundations of Artificial Intelligence. Introduction to Data Mining

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Sentiment analysis using emoticons

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining Individual Assignment report

Data Mining. Nonlinear Classification

A Mechanism for Selecting Appropriate Data Mining Techniques

Comparison of K-means and Backpropagation Data Mining Algorithms

Machine Learning: Overview

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables

LVQ Plug-In Algorithm for SQL Server

Distances, Clustering, and Classification. Heatmaps

Clustering Connectionist and Statistical Language Processing

Prototype-based classification by fuzzification of cases

Machine Learning Capacity and Performance Analysis and R

Knowledge Discovery and Data Mining

An Overview of Knowledge Discovery Database and Data mining Techniques

PREDICTIVE MODELS OF EMPLOYEE VOLUNTARY TURNOVER IN A INDIAN PROFESSIONAL SALES FORCE USING DATA-MINING ANALYSIS

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining and Soft Computing. Francisco Herrera

Introduction to Spatial Data Mining

AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM

Data Mining Fundamentals

Data Mining - Evaluation of Classifiers

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Rule based Classification of BSE Stock Data with Data Mining

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Big Data Analytics CSCI 4030

Predicting Student Performance by Using Data Mining Methods for Classification

Information Management course

Improving spam mail filtering using classification algorithms with discretization Filter

An Introduction to Data Mining

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Data Mining Classification Techniques for Human Talent Forecasting

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Decision Support System For A Customer Relationship Management Case Study

Support Vector Machines with Clustering for Training with Very Large Datasets

Financial Trading System using Combination of Textual and Numerical Data

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Using Data Mining for Mobile Communication Clustering and Characterization

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Chapter 7. Cluster Analysis

DATA MINING, DIRTY DATA, AND COSTS (Research-in-Progress)

Top Top 10 Algorithms in Data Mining

Introduction to data mining. Example of remote sensing image analysis

Data Mining: Overview. What is Data Mining?

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Introduction to Data Mining

Course 395: Machine Learning

How To Cluster

Mining the Software Change Repository of a Legacy Telephony System

Top 10 Algorithms in Data Mining

Performance Analysis of Decision Trees

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Data, Measurements, Features

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Spam Detection A Machine Learning Approach

Keywords Data Mining, Knowledge Discovery, Direct Marketing, Classification Techniques, Customer Relationship Management

Data Mining Analytics for Business Intelligence and Decision Support

Transcription:

Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann Publishers. Larose, Daniel T. (2005). Discovering Knowledge In Data An Introduction to Data Mining. New Jersey: John Wiley and Sons Ltd. Pang-NingTan,Michael Steinbach, Vipin Kumar (2006), Introduction to Data Mining, AddisonWesley. Alpaydın, E. (2010). Introduction to Machine Learning. Second Ed. London:MIT Press. 2 Supervised vs. Unsupervised Learning Supervised learning Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on the training set Unsupervised learning The class labels of training data is unknown Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Classification: Definition In classification, there is a target categorical variable, (e.g., income bracket), which is partitioned into predetermined classes or categories, such as highh income, middle income, and low income. 3 1

10 10 Classification: Definition Training set : The set of tuples used for model construction is training set. Given a collection of records. Each record contains a set of attributes, one of the attributes is the class. Each tuple/record is assumed to belong to a predefined class, as determined by the class attribute. Test set: A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and testt set used to validate it. Test set is independent of training set (otherwise overfitting) Illustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? Learn Model Apply Model 15 No Large 67K? Classification: Definition Model construction: Find a model for class attribute as a function of the values of other attributes. The model is represented as classification rules, decision trees, or mathematical formulae Goal: previously unseen or new records should be assigned a class as accurately as possible. Process (1): Model Construction Training Data Classification Algorithms Accuracy rate is the percentage of test set samples that are correctly classified by the model. If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classifier (Model) IF rank = professor OR years > 6 THEN tenured = yes 8 2

Process (2): Using the Model in Prediction Prediction Problems: Classification vs. Numeric Prediction Testing Data Classifier NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes Unseen Data (Jeff, Professor, 4) Tenured? 9 Classification predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Numeric Prediction models continuous-valued functions, i.e., predicts unknown or missing values 10 Classification Techniques Instance-Based Classifiers Decision Tree based Methods Neural Networks Other Classification Methods Rule-based Methods Memory based reasoning Bayes Classification Methods Support Vector Machines Lazy Learning Lazy learning Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple Lazy: less time in training but more time in predicting 12 3

Instance-Based Methods (Lazy Learner) Instance-based learning: Store training examples and delay the processing ( lazy (lazy evaluation ) until a new instance must be classified Typical approaches k-nearest neighbor approach Instances represented as points in a Euclidean space. K-Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it s probably a duck Compute Distance Test Record Locally weighted regression Case-based reasoning Training Records Choose k of the nearest records 13 K-Nearest-Neighbor Classifiers Definition of k-nearest Neighbor Unknown record Requires three things The set of stored records Distance Metric to compute distance between records The value of k, the number of nearest neighbors to retrieve To classify an unknown record: Compute distance to other training records Identify k nearest neighbors Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x 4

Nearest Neighbor Classification Compute distance between two points: Euclidean distance d( p, q) = i ( p i q i ) 2 Nearest Neighbor Classification Choosing the value of k: If k is too small, sensitive to noise points If k is too large, neighborhood may include points from other classes Determine the class from nearest neighbor list take the majority vote of class labels among the k- nearest neighbors Nearest Neighbor Classification Scaling issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes (normalization) Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 40kg to 100kg income of a person may vary from $10K to $1M Nearest neighbor Classification k-nn classifiers are lazy learners It does not build models explicitly They re not eager learners such as decision tree induction and rule-based systems Classifying unknown records are relatively expensive. 5

21 X1 X2 class Example 5 4 + 4 7 + 1 6 + 2 7 + 2 4 + 2 2 + 1 6 + 4 1 + 6 1 + 4 1 + 10 10 5 8 10 5 8 4 8 6 5 8 4 5 7 4 6 9 7 7 5 8 10 6 10 6 10 4 8 3? 6